A WARPED  FILTER  IMPLEMENTATION 
FOR  THE  LOUDNESS  ENHANCEMENT  OF  SPEECH 


By 

MARC  ANDRE  BOILLOT 


A DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 
OF  THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT 
OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 
DOCTOR  OF  PHILOSOPHY 

UNIVERSITY  OF  FLORIDA 


MAY  2002 


ACKNOWLEDGMENTS 


The  completion  of  a dissertation  is  a considerable  undertaking,  and  one  which  is  tem- 
pered by  discipline  and  patience.  The  achievement  is  the  degree  of  philosophy,  the  most 
noble  of  academic  rewards.  In  the  course  of  this  journey  there  have  been  a few  individuals 
who  have  changed  not  only  the  way  I think  as  an  engineer  but  the  way  I think  as  a person. 
To  these  people,  I am  grateful.  I sincerely  thank  Dr.  John  Harris,  who  has  been  my  mentor, 
advisor,  and  friend.  He  has  compelled  me  to  think  with  such  enthusiasm  and  clarity,  that 
it  examines  the  origin  of  my  thoughts.  He  has  enlightened  me,  and  I will  have  the  rest 
of  my  life  to  favorably  look  back  upon  this  experience.  I thank  Dr.  Principe,  whose  abil- 
ity to  capture  thought  through  expression,  and  approach  in  conveying  this  understanding 
through  engineering  and  math,  is  the  gift  of  a true  composer.  I thank  Dr.  Taylor,  who 
encouraged  me  to  pursue  these  advanced  studies,  and  I listened.  And,  to  Dr.  Bedenbaugh, 
my  admiration  and  respect  for  the  study  of  neuroscience  is  what  brought  me  to  graduate 
academics. 

This  research  commitment  would  not  be  possible  without  the  support  and  dedication 
of  Chin  Wong  and  Scott  Koenigsman,  my  managers  at  Motorola  who  have  given  me  an 
unprecedented  opportunity  to  pursue  this  research.  I would  like  to  express  my  sincere 
gratitude  to  V.P.  Jaime  Borras  and  Zaffer  Merchant  for  not  only  funding  this  research  but 
proposing  the  topic  that  has  become  this  dissertation.  Jaime’s  vision  that  if  you  can  dream 
it,  you  can  do  it,  is  the  genesis  of  this  commitment.  I gratefully  thank  these  individuals 
for  always  placing  in  me  their  confidence  and  trust.  They  have  allowed  me  to  develop 
professionally,  in  my  career  and  as  an  individual. 

I humbly  thank  my  father  whose  eccentric  philosophies  and  examinations  of  life  are  the 
inspiration  of  my  achievements.  As  well,  I cannot  thank  enough  my  mother,  sister,  fiancee, 
and  family  for  their  love  and  never  ending  support.  And,  to  all  my  colleagues,  it  has  been 
a wonderful  journey  in  intellect,  understanding,  and  friendship. 


11 


TABLE  OF  CONTENTS 


page 


ACKNOWLEDGMENTS ii 

LIST  OF  TABLES vi 

LIST  OF  FIGURES viii 

ABSTRACT xii 

CHAPTER 

1 INTRODUCTION 1 

1.1  Background 2 

1.2  Speech  Enhancement 5 

1.3  Contributions  and  Chapter  Organization 7 

2 MODELS  OF  LOUDNESS  11 

2.1  Loudness 12 

2.1.1  Critical  Bands 14 

2.1.2  Auditory  Filters 16 

2.1.3  Excitation 18 

2.2  Measuring  Loudness  21 

2.2.1  Power  Law  of  Hearing  22 

2.2.2  Loudness  and  Bandwidth 25 

2.2.3  Outer  to  Middle  Ear  Filter 27 

2.3  Calculating  ISO-532B  Loudness 30 

2.3.1  Specific  Loudness 30 

2.3.2  Slope  Excitation 32 

2.3.3  Discussion 33 

2.4  Simplifying  the  Loudness  Model 34 

2.4.1  PLP  Technique 35 

2.4.2  Extending  PLP  for  Loudness 36 

2.4.3  The  Loudness  Approximation 38 

2.4.4  Model  Discussion 42 

3 VOWEL  POWER  46 

3.1  Vowels 47 

3.1.1  Synthetic  Model 48 

3.1.2  Masking  Effects 50 

3.1.3  TIMIT 54 

3.2  Identification  59 

iii 


3.2.1  Loudness  Adaption  and  Auditory  Fatigue 60 

3.2.2  Formant  Expansion 62 

3.2.3  Modulation  Depth 64 

3.2.4  Synthetic  Vowel  Loudness  65 

4 WARPED  LINEAR  PREDICTION 68 

4.1  Linear  Prediction  Model 69 

4.2  Bandwidth  Expansion 71 

4.3  Vocoders 73 

4.3.1  Perceptual  Noise  Weighting  74 

4.3.2  Adaptive  Post-filtering 75 

4.4  Warped  Filtering 78 

4.5  Warped  Filter  Structures 83 

4.5.1  Analysis  Filter 83 

4.5.2  Synthesis  Filter 88 

4.5.3  Direct  Form  Filter 89 

4.5.4  Warped  Bandwidth  Expansion 92 

4.5.5  Filter  Structure 94 

4.6  Auditory  Modelling 102 

4.7  The  Gamma  Filter 105 

5 OBJECTIVE  EVALUATIONS  112 

5.1  ISO-532B  Analysis 113 

5.1.1  The  Optimal  Warping  Factor  119 

5.1.2  Warped  Filter  Loudness  121 

5.1.3  Equating  Energy  to  Loudness 125 

5.1.4  Results 129 

5.2  Speech  Recognition 137 

5.2.1  Spectral  Distortion  Measures 137 

5.2.2  A Measure  of  Loudness  Distortion 139 

5.3  Recognition  Results 144 

5.3.1  DTW  Results 147 

5.3.2  HMM  Results 148 

6 SUBJECTIVE  EVALUATIONS 152 

6.1  Measures  of  Speech  Intelligibility 152 

6.2  Intelligibility  Test 155 

6.2.1  Procedure 156 

6.2.2  Intelligibility  Results 157 

6.3  Loudness  Test 158 

6.3.1  Procedure 158 

6.3.2  Sensitivity  Screening 160 

6.3.3  Loudness  Results 161 

6.4  Acceptability  Test 163 

6.4.1  Procedure 163 

6.4.2  Acceptability  Results 164 


IV 


7 CONCLUSIONS 


166 


APPENDICES 

A FILTER  COEFFICIENT  TRANSFORMATION 171 

B WARPED  PHASE 176 

C HMM  TRAINING 177 

REFERENCES 184 

BIOGRAPHICAL  SKETCH  194 


V 


LIST  OF  TABLES 


Table  page 

3.1  TIMIT  TEST  phoneme  occurrences  (N),  power  (P%),  accessory  Loudness 

(aL%),  masked  Power  (mP%),  and  sone  loudness  approximation  error  (E%)  55 


3.2  Relative  occurrence  (N),  total  average  power  (P),  average  masked  Power 
(mP),  average  contribution  of  accessory  loudness  (aL),  and  approximation 
error  (E)  for  all  phoneme  categories  of  the  TIMIT  test  set 56 

5.1  Dialect  regions  and  number  of  speakers  in  each  region 120 

5.2  Phoneme  categories  and  the  loudness  gain  difference  between  the  warped  and 

linear  bandwidth  expansion  filters 123 


5.3  TIMIT  TEST  phoneme  occurrences  (N),  power  (P%),  Linear  (a  = 0)  loudness 
increase  (number  of  times  louder)  Ny/N^  , Warped  (a  = 0.5)  loudness 
increase  Ny/Nx 124 


5.4 


Equal  energy  phoneme  gains  for  linear  expansion  a = 0 with  the  true  ISO- 
532B  for  the  1,681  sentences  of  the  TIMIT  test  set.  The  ratio  Ny/Nx  is  the 
loudness  increase  of  the  enhanced  phoneme  to  the  original  (number  of  times 
louder),  dBpa,„  is  the  gain  from  Eq(5.13)  required  to  scale  up  the  original 
to  achieve  equal  loudness,  Ny/Ngx  is  the  loudness  increase  of  the  enhanced 
to  the  scaled  original  and  E = \1  — Ny/Ngx\  is  the  approximation  error  of 

dBgojjj 


131 


5.5  Equal  energy  phoneme  gains  for  linear  expansion  a = 0 with  the  warped 
approximation  of  the  ISO-532B  for  the  1,681  sentences  of  the  TIMIT  test 
set  with  SPL  levels  between  50  and  80dB.  The  ratio  Ny/Nx  is  the  loudness 
increase  of  the  enhanced  phoneme  to  the  original  (number  of  times  louder), 
dBgain  is  the  gain  from  Eq(5.13)  required  to  scale  up  the  original  to  achieve 
equal  loudness,  Ny/Ngx  is  the  loudness  increase  of  the  enhanced  to  the 
scaled  original  and  E = \l  — Ny/Ngx\  is  the  approximation  error  of  dBgain-  132 


5.6  Equal  energy  phoneme  gains  for  warped  expansion  a = 0.5  with  the  true  ISO- 
532B  for  the  1,681  sentences  of  the  TIMIT  test  set  with  SPL  levels  between 
50  and  80dB.  The  ratio  Ny/Nx  is  the  loudness  increase  of  the  enhanced 
phoneme  to  the  original  (number  of  times  louder),  dBgam  is  the  gain  from 
Eq(5.13)  required  to  scale  up  the  original  to  achieve  equal  loudness,  Ny/Ngx 
is  the  loudness  increase  of  the  enhanced  to  the  scaled  original  and  E = 

|1  — Ny/Ngx\  is  the  approximation  error  of  dB gain 134 


VI 


5.7  Equal  energy  phoneme  gains  for  warped  expansion  a = 0.5  with  the  approxi- 

mation of  the  ISO-532B  for  the  1,681  sentences  of  the  TIMIT  test  set  with 
SPL  levels  between  50  and  80dB.  The  ratio  Ny/Nx  is  the  loudness  increase 
of  the  enhanced  phoneme  to  the  original  (number  of  times  louder),  dB(,Qj„ 
is  the  gain  from  Eq(5.13)  required  to  scale  up  the  original  to  achieve  equal 
loudness,  Ny/Ngx  is  the  loudness  increase  of  the  enhanced  to  the  scaled 
original  and  E = \1  — NyjNgx\  is  the  approximation  error  of  dBpaj„.  . . . 135 

5.8  Average  Loudness  increase,  equivalent  dB  gain,  and  approximation  error  for 

phoneme  categories  using  the  true  ISO-532B  for  a = 0 and  a = 0.5  from 
Table(5.6)  for  TIMIT  test  sentences  with  SPL  levels  between  50  and  80dB.  136 


5.9  Comparison  of  the  average  correlation  coefficient  p between  Objective  and 

Subjective  speech  quality  [21] 138 

5.10  DTW  results  for  original  and  warped  speech  templates:  Number  of  vocabulary 

words  correctly  recognized  for  speaker  901  in  Motorola  stars  database.  20 
words  vs  6 enumerated  conditions;  Train  in:  cond  1 rep  1 2,  Test  in:  cond 
1 2 3 4 5 6 rep  all 149 

5.11  HMM  Discrete  results  for  vocabulary  words  correctly  recognized  for  speaker 

901  in  Motorola  stars  database.  20  words  vs  6 conditions  for  original  and 
warped  speech  templates.  Train  in:  cond  1 2 3 4 5 6 rep  1 2 Test  in:  cond 
1 2 3 4 5 6 rep  all 150 

5.12  HMM  Continuous  results  for  vocabulary  words  correctly  recognized  for  speaker 

901  in  Motorola  stars  database.  20  words  vs  6 conditions  for  original  and 
warped  speech  templates.  Train  in:  cond  1 2 3 4 5 6 rep  12 151 

6.1  Vocabulary  of  words  used  for  Rhyming  Test  of  Intelligibility,  subdivided  into 

confusable  sets  I-III 156 

6.2  Average  intelligibility  results  of  the  rhyme  test  for  16  listeners  hearing  60 

words  with  0 dB  SNR.  Table  results  are  displayed  as  the  percent  correct 
Pn  population  mean  with  ±95%  Confidence  Level  157 

6.3  Vocabulary  of  words  used  for  Loudness  Test 158 

6.4  Loudness  listening  test  for  warped  filter  with  a — 0.5:  Total  number  of  times 

the  processed  word  was  selected  over  the  original  word  for  all  16  listeners  . 162 


6.5  Sentence  acceptability  results  for  original  sentences  (A)  and  processed  sen- 
tences (B)  with  warped  filter  for  a = 0.5.  Twenty  random  sentences  from 
the  TIMIT  dataset  were  presented  to  each  of  16  Listeners.  The  Quality 
rating  1 (excellent)  to  3 (fair)  is  their  mean  response  for  the  20  sentences, 
and  #L  column  is  the  number  of  times  a sentence  was  selected  as  being 
louder.  It  is  given  as  a percentage  in  the  last  column 


vii 


165 


LIST  OF  FIGURES 

Figure  page 

2.1  Equal  loudness  curves 14 

2.2  Mapping  of  the  linear  frequency  scale  to  the  critical  band  scale  given  by 

Eqs(2.2)  and  (2.3) 16 

2.3  Roex  auditory  filters  for  input  levels  50  to  90dB  at  center  frequencies  lOOHz, 

IKHz,  and  3KHz 18 

2.4  Example  of  pure  tone  masking  threshold  generated  by  a narrow  band  masker.  19 

2.5  Example  of  noise  notch  method  to  trace  out  auditory  filter  shapes 20 

2.6  Generation  of  excitation  function,  a)  individual  auditory  filter  responses  from 

a IKHz  sinusoid  input,  and  b)  resulting  excitation  pattern 20 

2.7  Excitation  level  versus  critical  band  pattern  for  IKHz  tone.  Threshold  in 

quiet  indicated  by  dashed  line 21 

2.8  Relation  between  loudness  and  bandwidth  a)  input  narrowband  noise  cen- 

tered at  IKHz  with  bandwidths  40,  80,  160,  320,  640  and  1280Hz  all  at 
constant  60dB  SPL  b)  corresponding  excitation  patterns,  and  c)  resulting 
loudness  pattern 26 

2.9  Loudness  of  tones  separated  by  a critical  band 27 

2.10  Outer  to  middle  ear  filter  given  by  Eq(2.20)  for  various  values  of  i? 29 

2.11  16  weighting  functions  used  to  compute  0(fl(m)) 37 

2.12  Linear  approximation  to  excitation  slopes  generated  by  roex  auditory  filters.  38 

2.13  Frequency  warping  using  Oppenheim  recursion  on  autocorrelation  sequence.  39 

2.14  Outer  to  middle  sensitivity  characteristics 40 

2.15  Determination  of  maximum  interim  excitations 41 

2.16  Absolute  threshold  of  hearing 42 

2.17  Loudness  predictions  of  the  ISO-532B  {dotted)  and  the  warped  loudness  ap- 

proximation {solid) 43 

2.18  Loudness  prediction  of  ISO-532B  {dotted)  and  approximation  {solid)  ....  44 

3.1  Average  formant  locations  for  vowels  in  American  English  (Peterson  and 

Barney,  1952) 46 

viii 


3.2  Average  formant  locations  and  bandwidths  for  vowels  in  American  English 

with  corresponding  dB  drop  of  formant  amplitude  from  60dB  reference  [21].  48 

3.3  Five  pole  formant  synthesis  of  10  American  English  vowel  spectra  {y-axis  in 

dB,  x-axis  is  O-^KHz) 49 

3.4  ISO-532B  vowel  loudness  patterns  with  accessory  loudness  due  to  masking 

in  shaded  regions 50 

3.5  a)  Tone  and  b)  narrowband  masking  thresholds 52 

3.6  Auditory  Masking  Threshold 54 

3.7  Percent  of  masked  power  in  vowel  regions  {darkened)  of  TIMIT  speech  sen- 

tence  57 

3.8  Accessory  loudness  of  TIMIT  speech  sentence  {vowel  regions  darkened).  . . 58 

3.9  Formant  bandwidth  expansion  on  synthetic  vowel  / a/;  a)  LPC  pole  displace- 

ment broadens  bandwidth  by  reducing  formant  pole  peaks,  and  b)  elevation 
of  spectrum  to  restore  energy 64 

3.10  Perceived  equal  loudness  time  functions  {sentence  and  noise) 66 

3.11  The  increase  of  loudness  as  a function  of  vowel  bandwidth 67 

4.1  Pole  displacement  model  used  to  demonstrate  an  evaluation  off  the  unit  circle 

with  r > 1 results  in  a broadened  pole  response  (shaded  region) 71 

4.2  Relation  of  pole  distance  from  jw-ax\s  to  pole  bandwidth  in  Laplace  space.  72 

4.3  General  CELP  coder  block  diagram 74 

4.4  Perceptual  noise  weighting  a)  vocal  tract  1/A(z),  b)  coding  noise  A{z / fd)  j A{z) , 

and  c)  excitation 75 

4.5  General  CELP  decoder  block  diagram 76 

4.6  Response  of  l/A(z//3)  for  various  values  of  ,0 77 

4.7  Critical  band  frequency  warping  using  Oppenheim  recursion  on  autocorrela- 

tion sequence 82 

4.8  Analysis  filter  element 83 

4.9  Unit  delay  replacement  with  all-pass 84 

4.10  Frequency  warping  characteristics  of  the  all-pass  element  described  by  Eq(4.23) 

for  different  values  of  the  warping  factor  —0.8  < a < 0.8 85 

4.11  Frequency  warping  characteristics  of  all-pass  {dotted  from  Eq(4.23))  com- 

pared to  critical  band  scale  {solid  from  Eq(2.3)) 86 

4.12  Direct  substitution  of  all-pass  elements  in  FIR 87 

4.13  Synthesis  filter  element 88 


IX 


4.14  Modified  synthesis  filter  element  after  coefficient  transformation 90 

4.15  Modified  analysis  filter 90 

4.16  WLPC  vocoder  cited  in  [130] 92 

4.17  Changing  implementation  order  of  WLPC  vocoder  for  use  as  a WIIR  filter.  94 

4.18  Formant  bandwidth  expansion  filter  with  frequency  scale  set  by  locally  re- 

current a parameter 95 

4.19  Family  of  curves  showing  frequency  dependent  bandwidth  expansion  for  a 

particular  evaluation  radius  from  the  warped  filter 96 

4.20  Warped  filter  output  gain  curves  (normalized  for  unity  at  a = 0)  for  a sinu- 

soidal chirp  signal  on  an  evaluation  radius  of  1.02 97 

4.21  Family  of  gain  compensation  curves  given  by  Eq(4.37) 99 


4.22  Spectral  envelope  of  a synthetic  vowel  for  warped  bandwidth  expansion  l/A(z//3) 

{solid)  and  original  1/j4(z)  {dotted).  Demonstrates  an  evaluation  off  the 
unit  circle  with  a warping  factor  a = 0 results  in  a uniform  bandwidth 
change  for  all  formants;  a)  time  response,  b)  frequency  response,  and  c) 
spectral  envelope.  One  slider  is  used  to  set  the  warping  factor  —0.6  < a < 

0.6.  Another  is  used  to  set  the  evaluation  radius  and  a third  slider  allows 
first  order  low-pass  or  high-pass  filtering  1 ± to  adjust  for  spectral 
tilt.  Loudness  levels  using  the  ISO-532B  are  given  for  the  original  and 
processed  vowel,  and  original  formant  bandwidths  {dotted)  are  all  50Hz.  . 100 

4.23  Spectral  envelope  of  a synthetic  vowel  for  warped  bandwidth  expansion  1/ A{zf  0) 

{solid)  and  original  1/A{z)  {dotted).  Demonstrates  an  evaluation  off  the 
unit  circle  with  a warping  factor  a - 0.34  results  in  non-uniform  band- 
width change  for  all  formants;  a)  time  response,  b)  frequency  response, 
and  c)  spectral  envelope.  One  slider  is  used  to  set  the  warping  factor 
—0.6  < a < 0.6.  Another  is  used  to  set  the  evaluation  radius  and  a third 
slider  allows  first  order  low-peiss  or  high-pass  filtering  1 ±/iz“^  to  adjust  for 
spectral  tilt.  Loudness  levels  using  the  ISO-532B  are  given  for  the  original 
and  processed  vowel,  and  original  formant  bandwidths  {dotted)  are  all  50Hz.l01 


4.24  WLPC  Gain  adjustment 103 

4.25  Model  of  the  synthetic  vowel  /a/  with  LPC  and  WLPC  envelope  on  linear 

frequency  scale  (top)  and  warped  scale  (bottom) 104 

4.26  Pole  radii  scaling  necessary  to  achieve  bandwidth  effects  of  Fig(4.19) 105 

4.27  Gamma  bases  given  by  Eq(4.45) 107 

4.28  Relation  between  the  z and  7 domains 109 

4.29  Stable  higher  order  WIIR  filters 110 

4.30  Substitution  of  locally  recurrent  feedback  loop  with  gamma  kernel Ill 


X 


4.31  Twenty  center  filter  tap  magnitude  values  for  a)  low-pass  element  of  Fig. 

4.29  and  b)  7 element  of  Fig.  4.30 Ill 

5.1  Energy  redistribution  a)  time  plot  and  b)  corresponding  ISO-532B  loudness 

pattern 115 

5.2  SOLA  temporal  modification  a)  time  plot  and  b)  corresponding  ISO-532B 

loudness  pattern 117 

5.3  Formant  expansion  a)  time  plot  and  b)  corresponding  ISO-532B  loudness 

pattern 118 

5.4  A sweep  of  a to  determine  the  frequency  scale  for  the  optimal  gain  in  loudness. 

Shows  mean  a curves  for  dialect  regions  1 to  8,  and  the  corresponding 
variance  delimited  by  bars 121 

5.5  A sweep  of  the  warping  factor  a versus  the  average  change  in  loudness  for 

vowel  phonemes  of  each  sentence  in  all  8 dialect  regions  of  the  TIMIT  test 
dataset.  Sentence  numbers  are  along  the  axis  projecting  into  the  page.  . 122 

5.6  Real  part  of  as  a function  of  b,  the  Bark  scale,  for  values  of  fc 143 

5.7  Speech  recognition  test  GUI 146 

5.8  SR-8  documentation  results  of  recognition  performance 147 

6.1  Intelligibility  test  GUI.  A presentation  of  60  random  TI-46  utterances  is 

presented  to  the  listener  at  OdB  SNR,  of  which  50%  are  processed  by  the 
warped  filter 156 

6.2  Loudness  test  GUI.  A total  of  80  words  are  presented  to  each  listener  of  which 

85%  of  the  words  are  processed  by  the  warped  filter.  A random  scaling 
gradient  between  0 to  5 dB  in  increments  of  1 dB  is  applied  to  the  speech 
tokens  to  determine  the  perceptual  gain 159 

6.3  Solid  line  shows  the  effective  dB  gain  of  the  warped  filter  lies  just  above 

the  2dB  crossover  point.  Graph  provides  a scaled  comparison  of  the  av- 
erage loudness  ratings  for  the  TI-46  words  processed  by  the  warped  filter 
presented  to  16  listeners.  The  dotted  line  corresponds  to  the  sensitivity 
screening,  which  shows  the  listeners'  hearing  resolution  is  well  separated 
at  2dB.  Bars  are  the  95%  confidence  intervals  of  Eq(6.2) 161 

6.4  Acceptability  test  GUI.  A presentation  of  20  sentence  pairs  are  presented  to 

the  listener  to  rate  quality  and  overall  loudness.  One  sentence  in  each  pair 
is  processed  by  the  warped  filter,  and  the  other  is  the  original 164 


XI 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 
of  the  University  of  Florida  in  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of  Doctor  of  Philosophy 


A WARPED  FILTER  IMPLEMENTATION 
FOR  THE  LOUDNESS  ENHANCEMENT  OE  SPEECH 


By 

Marc  Andre  Boillot 
MAY  2002 

Chairman:  Dr.  John  G.  Harris 

Major  Department:  Electrical  and  Computer  Engineering 

Cellular  phones  and  small  hand-held  audio  devices  have  limited  power  configuration 
with  high  current  drain  audio  speaker  capabilities.  For  manufacturing  purposes,  and  cost 
savings,  the  audio  speakers  chosen  usually  balance  product  requirements  to  manufacturing 
costs.  Larger  less  expensive  speakers  are  usually  integrated  into  the  product  since  smaller 
high  quality  power  efficient  speakers  are  more  expensive.  Much  of  the  current  focus  in 
industry  technology  has  been  better  speaker  design,  or  more  efficient  power  amplifiers  to 
minimize  battery  drain  for  speaker  phone  operations.  No  energy  conservation  schemes 
directly  operate  on  the  speech  signal.  The  question  we  address  in  this  dissertation  is  how 
to  make  speech  sound  louder  without  increasing  the  signal  energy. 

We  propose  a real-time  warped  filter  which  exploits  the  psychoacoustic  nature  of  the 
auditory  system  to  enhance  the  perception  of  loudness  without  adding  energy.  The  fre- 
quency resolution  of  sound  in  the  human  auditory  system  is  on  a non-linear  scale  called  the 
critical  band  scale.  The  critical  band  concept  in  auditory  theory  states  that  for  a constant 
energy  bandwidth  product,  loudness  increases  when  a critical  band  is  exceeded.  A warped 
filter  is  proposed  and  developed  to  elevate  the  perception  of  loudness  by  applying  nonlinear 

xii 


bandwidth  expansion  to  the  formant  regions  of  vowels  in  accordance  with  the  critical  band 
scale.  This  is  the  first  known  study  to  propose  an  algorithm  which  elevates  the  perception 
of  loudness  without  adding  energy.  It  is  also  the  first  known  study  to  define  a filter  which 
adjusts  formant  bandwidths  on  a critical  band  scale,  and  to  use  a warped  filter  for  speech 
enhancement.  The  underlying  technique  is  an  extension  of  the  linear  bandwidth  broadening 
technique  used  for  speech  modelling  in  speech  recognition,  perceptual  noise  weighting,  and 
vocoder  post-filter  designs.  It  is  a pole-displacement  model,  which  is  a computationally 
efficient  technique,  and  is  included  in  the  linear  transformation  of  the  warped  filter  coeffi- 
cients. In  a warped  recursive  filter,  a coefficient  or  filter  transformation  is  necessary  to  avoid 
un-realizable  time  dependencies.  In  this  thesis  we  include  the  pole  displacement  model  in 
a warped  filter  implementation  for  formant  critical  bandwidth  expansion.  The  inclusion 
of  a warped  pole  displacement  model  for  nonlinear  bandwidth  expansion  in  the  filter  was 
motivated  from  the  critical  band  concept  of  hearing.  The  filter  implementation  has  been 
inspired  by  the  biological  representation  of  loudness  in  the  peripheral  auditory  system,  and 
subjective  listening  tests  confer  that  a noticeable  improvement  up  to  2dB  is  attainable. 


xiii 


CHAPTER  1 
INTRODUCTION 


There  is  a large  world  market  for  hand-held  wireless  communication  devices,  and  the 
consumer  demand  grows  every  day.  It  is  always  of  concern  to  design  these  systems  to 
operate  on  the  lowest  amount  of  power.  The  power  saving  can  extend  battery  life  or 
processing  capability.  Small  savings  in  power  can  translate  to  longer  battery  life,  which  for 
the  consumer  may  suggest  a better  product  which  lasts  longer.  Many  cell  phones  and  small 
consumer  audio  appliances  with  limited  power  configuration  have  audio  speaker  capabilities. 
Many  cell  phones  are  equipped  with  speakerphones  that  project  the  speech  to  the  listener 
instead  of  being  directly  coupled  to  the  ear.  For  manufacturing  pimposes  and  cost  savings, 
speakers  are  usually  chosen  to  balance  product  requirements  with  manufacturing  costs. 
Larger  less  expensive  speakers  are  usually  integrated  into  the  product.  Smaller  high  quality 
speakers  are  typically  more  expensive  though  more  power  efficient.  Much  of  the  current 
focus  in  industry  technology  has  been  better  speaker  design  and  more  efficient  resourcing 
of  current  drain  from  the  peripheral  support  and  amplification  chips. 

No  energy  conservation  schemes  operate  directly  on  the  speech  signal.  Speech  enhance- 
ment systems  have  been  traditionally  designed  for  one  of  two  purposes:  to  eliminate  noise 
or  improve  the  intelligibility  of  speech.  Noise  suppression  systems  alleviate  the  effects  of 
contaminant  noise  on  the  speech  signal.  Speech  processing  systems,  such  as  vocoders,  focus 
on  one  particular  aspect  of  speech  intelligibility,  to  recreate  the  speech  signal  as  it  was  orig- 
inally recorded.  The  object  of  this  study  has  been  to  develop  a speech  processing  technique 
which  exploits  the  psychoacoustic  nature  of  the  auditory  system  for  loudness  enhancement 
of  speech,  without  incurring  additional  signal  energy. 


1 


2 


In  this  dissertation  we  propose  a real-time  warped  filter  which  exploits  the  psychoa- 
coustic nature  of  the  auditory  system  to  enhance  the  perception  of  loudness  without  adding 
energy  or  degrading  intelligibility.  The  frequency  resolution  of  sound  in  the  human  auditory 
system  is  on  a non-linear  scale  called  the  critical  band  scale.  The  critical  band  concept  in 
auditory  theory  states  that  for  a constant  energy  bandwidth  product,  loudness  increases 
when  a critical  band  is  exceeded.  A warped  filter  is  proposed  and  designed  to  elevate  the 
perception  of  loudness  by  applying  nonlinear  bandwidth  expansion  to  the  formant  regions 
of  vowels  in  accordance  with  the  critical  band  scale.  Vowels  contain  the  highest  energy  in 
speech,  are  spectrally  smooth,  and  are  the  longest  in  duration.  We  will  see  that  hearing 
is  less  sensitive  to  bandwidth  changes  than  to  changes  in  formant  frequency,  and  vowel 
bandwidths  can  be  excessively  broadened  before  vowel  identification  deteriorates. 

The  technique  was  motivated  by  the  linear  bandwidth  broadening  technique  used  for 
speech  modelling  in  speech  recognition,  and  also  in  the  perceptual  noise  weighting  filter  in 
vocoder  design  for  bit  error  allocation.  The  known  linear  technique  is  a pole-displacement 
model  which  applies  a power  series  scaling  to  the  LPC  coefficients  to  evaluate  the  z transform 
on  a circle  other  than  the  unit  circle.  It  is  a particular  implementation  of  the  chirp-z 
transform  with  the  circle  as  the  evaluation  contour.  In  this  research  the  expansion  technique 
is  implemented  in  a realizable  warped  filter  to  expand  the  pole  bandwidths  on  a critical 
band  scale.  Since  the  warping  transform  is  a bilinear  mapping,  the  displacement  model  can 
be  applied  to  the  warped  z domain  to  achieve  nonlineax  formant  bandwidth  expansion. 

1.1  Background 

This  thesis  was  started  from  a simple  observation,  “ How  to  make  speech  sound  louder 
without  increasing  the  speech  energy  ” proposed  by  Jaime  Borras,  Vice  President  of  Tech- 
nology for  the  iDEN  Division  of  Motorola.  It  has  been  a very  rewarding  question,  and 
one  which  is  still  quite  elusive  and  not  completely  understood.  As  a good  painting  can  be 
painted  by  many  artists,  or  a piece  of  music  played  by  trained  musicians,  the  difficulty  of 
the  composition  lies  in  capturing  the  artist’s  intended  expression.  This  topic  has  opened 
up  many  doors  in  speech  processing  research,  and  is  one  such  question  that  defines  what 


3 


engineering  is  about.  It  is  about  modelling  the  physical  senses  through  engineering  for  a 
practical  and  beneficial  purpose.  Such  a question  as  the  one  proposed  requires  an  under- 
standing of  psychoacoustics,  psychophysics,  and  the  nature  of  sound  in  the  human  auditory 
system  as  well  as  the  principles  of  electrical  engineering.  The  engineering  is  possible  only 
after  the  model  is  understood,  and  the  tools  available  to  alter  the  model  are  available  and 
robust. 

The  model  is  the  human  auditory  system,  and  is  one  which  has  been  well  documented, 
investigated,  and  researched,  though  not  fully  understood.  The  tools  are  the  signal  pro- 
cessing techniques  used  in  speech  enhancement  and  speech  recognition.  They  have  been 
validated  over  years  of  research,  through  the  time  and  effort  of  many  mathematicians,  en- 
gineers, and  scientists.  Sometimes  the  search  for  an  answer  preceeds  the  question.  The 
objective  of  this  research  is  to  provide  an  engineering  technique  which  makes  speech  sound 
louder  without  increasing  speech  energy.  In  this  ca.se,  it  is  a matter  of  understanding  the 
physical  representation  of  loudness  in  the  human  auditory  system  and  knowing  what  tools 
are  available  before  an  answer  is  revealed. 

The  difficulty  of  the  question  also  lies  in  the  evolution  of  the  human  auditory  system. 
It  would  seem  practical  for  the  human  auditory  and  articulatory  systems  to  evolve  over 
time  as  efficiently  as  possible.  Has  nature  been  unable  to  capture  all  the  limitations  of  the 
human  auditory  system  we  are  trying  to  exploit?  It  would  make  sense,  from  a linguistic 
view,  that  speech  production  evolved  in  a manner  complementary  to  auditory  perception. 
Should  the  speech  production  system  expel  energy  to  represent  speech  that  to  some  degree 
is  inaudible?  Of  course,  as  we  will  see,  evolution  is  a valiant  adversary,  and  our  claim  to 
increase  loudness  without  increasing  energy  may  precipitate  a cost.  The  sacrifice  may  be 
intelligibility,  and  to  this  point,  we  presume  the  nature  of  the  human  auditory  system  has 
reached  a balance  between  loudness  and  intelligibility.  Intelligible  speech  at  a volume  to 
low  to  hear,  or  unintelligible  speech  at  a high  volume  does  not  contribute  to  effective  speech 
communication.  Speech  must  be  intelligible  and  at  a sufficient  level.  Fortunately,  we  will 
see  that  certain  regions  of  speech  are  less  susceptible  to  this  balance,  and  we  can  reach 
a compromise.  Certain  regions  can  be  manipulated  to  such  a degree  that  intelligibility  is 
not  sacrificed.  These  regions  correspond  to  the  vowel  regions  of  speech.  They  are  highly 


4 


resonant,  and  we  will  show  in  the  following  chapters  how  they  are  exploited  to  elevate  the 
perception  of  loudness. 

In  this  thesis,  we  present  a warped  filter  design  as  a new  application  of  the  vocoder  post- 
filter. The  speech  enhancement  filter  is  used  to  increase  the  overall  loudness  perception  of 
vowels  in  clean  speech,  without  increasing  signal  energy  or  degrading  intelligibility.  The  mo- 
tivation is  derived  from  a fundamental  principle  of  loudness.  For  an  equal  energy  bandwidth 
product,  loudness  will  increase  when  a critical  band  is  exceeded  [149].  The  warped  filter 
design  enhances  vowel  loudness  by  a non-linear  bandwidth  broadening  process  to  widen 
formant  bandwidths.  The  technique  was  inspired  by  the  linear  bandwidth  broadening  tech- 
nique used  for  speech  modelling  in  speech  recognition  applications  and  speech  enhancement 
for  improving  the  quality  of  vocoded  speech  [89,  85].  The  linear  model  is  used  in  speech 
recognition  to  model  noisy  speech,  since  noisy  speech  tends  to  broaden  speech  formants 
[125].  A bandwidth  broadened  representation  of  the  speech  will  be  a closer  match  to  the 
speech  when  captured  in  noisy  conditions.  It  is  also  used  as  a vocoder  post-filter  operation 
to  suppress  quantization  noise  and  enhance  the  general  quality  of  vocoded  speech  [15,  84]. 

We  should  note  that  the  formant  bandwidth  expansion  technique  of  the  warped  filter  re- 
lies on  the  premise  that  loudness  is  a function  of  critical  bandwidth.  The  filtering  technique 
is  derived  from  a psychoacoustic  model  based  on  the  critical  band  concept,  and  utilizes  only 
the  power  spectrum  information.  The  auditory  system  is  a very  complex  organization  of 
processes  which  ultimately  begins  at  the  peripheral  level.  The  higher  levels  of  interpretation 
and  cognition  are  not  completely  known,  and  thus  the  engineering  models  typically  reside 
at  the  lower  level.  The  motivation  of  the  warped  filter  materialized  at  the  peripheral  level 
to  successfully  exploit  the  perceptual  representation  of  loudness  in  regards  to  the  critical 
band  concept.  However,  the  temporal  order  as  well  as  frequency  content  and  separation, 
in  addition  to  other  factors,  play  a very  intricate  role  in  hearing.  This  includes  loudness 
which  is  the  percept  of  intensity.  Alterations  of  the  frequency  spectrum  in  exploiting  loud- 
ness will  have  an  affect  on  the  quality  of  the  sound.  Such  changes  in  the  relative  level  of 
different  parts  of  the  acoustic  spectrum  will  be  described  by  listeners  a.s  a change  in  “sound 
quality”  [44].  This  introduces  the  concept  of  profile  analysis  in  which,  the  relative  acuity 


5 


of  discriminating  a change  in  the  spectral  shape  represents  a different  but  complimentary 
process  than  the  acuity  of  detecting  a change  in  the  absolute  intensity  level  [44]. 

Loudness  is  inherently  related  with  the  ability  to  detect  intensity  change.  Profile  analy- 
sis suggests  the  quality  of  sound  is  determined  by  a simultaneous  and  relative  comparison  of 
different  parts  of  the  frequency  spectrum  rather  than  a successive  comparison  of  intensity. 
In  classical  auditory  theory,  intensity  discrimination  depends  in  some  way  on  a discrimina- 
tion of  a change  in  the  firing  rate  of  auditory  neurons  [50].  Profile  analysis  is  the  term  used 
to  describe  the  mechanism  of  detecting  these  changes  in  spectral  shape  [44].  Intensity  dis- 
crimination and  profile  analysis  have  been  proposed  as  two  separate  and  distinct  processes 
involved  in  hearing  a change  in  intensity  [44].  In  profile  analysis  it  is  the  abstraction  of 
the  constancy,  in  respect  to  the  frequency  spectrum,  that  establishes  the  noticeable  change 
of  a sound.  In  chapter  3 we  will  see  studies  which  demonstrate  the  temporal  modulation 
envelopes  of  the  frequency  spectrum  have  an  effect  on  the  loudness  and  quality  of  sounds. 
In  profile  analysis,  the  detection  process  similarly  involves  the  simultaneous  comparison 
of  the  relative  energy  levels  within  the  spectrum.  Profile  analysis  may  provide  additional 
insight  into  the  detection  strategies  of  the  auditory  system  in  understanding  loudness.  Per- 
haps, an  alternate  speech  processing  technique  which  incorporates  profile  analysis  to  exploit 
the  modulation  envelopes  can  be  developed  in  a manner  similar  to  the  formant  expansion 
method  which  exploits  the  critical  band  concept. 

1.2  Speech  Enhancement 

We  incorporate  formant  broadening  as  a speech  enhancement  method  for  increasing 
perceptual  loudness.  A warped  filter  is  proposed  and  developed  to  elevate  the  perception 
of  loudness  by  applying  nonlinear  bandwidth  expansion  to  the  formant  regions  of  vowels 
in  accordance  with  the  critical  band  scale.  This  is  the  first  known  filter  implementation  to 
increase  the  perception  of  loudness.  Warped  filters  have  primarily  been  used  for  auditory 
modelling  [75,  48].  They  require  a lower  order  than  a general  FIR  or  HR  filter  for  auditory 
modelling,  since  they  are  able  to  distribute  their  poles  in  accordance  with  the  frequency 
scale  [130].  The  warped  filter  applies  nonlinear  bandwidth  expansion  to  the  formant  regions 


6 


of  vowels  in  speech,  and  equally  redistributes  the  energy  to  raise  the  dilated  spectrum. 
Thus,  we  can  enhance  loudness  without  increa.sing  energy  by  bandwidth  broadening  vowel 
formants  in  speech.  The  filter  is  termed  a warping  filter  because  we  alter  the  frequency 
scale  to  achieve  non-linear  expansion  [102].  Warped  filters  are  closely  related  to  Laguerre 
filters.  Time-dispersive  all-pass  elements  in  the  warped  filter  inject  time  dependence  which 
results  in  non-uniform  frequency  resolution.  These  warping  characteristics  allow  a spectral 
representation  of  the  speech  which  closely  approximates  the  frequency  selectivity  of  human 
hearing. 

Speech  enhancement  methods  are  employed  as  techniques  which  attempt  to  emphasize 
the  salient  characteristics  of  speech  or  enhance  the  aesthetic  quality.  The  techniques  fo- 
cuses on  emphasizing  certain  regions  of  speech  which  have  been  experimentally  determined 
to  contain  important  acoustic  or  auditory  cues  essential  to  speech  understanding  and  dis- 
crimination. In  most  cases  no  well  defined  measure  exists  which  truly  maps  a subjective 
quality  to  an  objective  quantity.  In  many  cases  speech  enhancement  implies  some  form  of 
noise  suppression  where  the  defined  target  is  noise.  Noise  is  the  contaminant  signal,  and 
the  approach  is  to  develop  estimation  functions  which  maximize  a signal  to  noise  ratio. 
The  statistical  motivation  behind  spectral  subtraction  type  and  adaptive  filtering  methods 
is  to  optimize  an  error  criterion  in  a mean  squared  error  sense  [104,  27].  These  types  of 
speech  enhancement  methods  are  techniques  for  noise  suppression,  and  involve  some  aspect 
of  estimating  a noise  criterion  in  respect  to  the  original  signal.  Many  approaches  to  speech 
enhancement  have  been  employed,  most  for  the  purpose  of  noise  suppression  using  estimates 
of  noise  regions  [8,  17].  Methods  either  attempt  to  improve  the  quality  of  the  signal  during 
acquisition,  somehow  compensate  the  signal  prior  to  its  anticipated  environment  condition, 
adjust  the  signal  on  its  way  out  to  the  environment,  or  introduce  active  measures  to  remove 
external  signal  interference  [12,  47]. 

General  speech  processing  methods  use  some  form  of  the  short-time  spectral  envelope 
since  the  frames  are  typically  long  enough  to  resolve  linguistic  components  of  speech,  such 
as  phonemes,  yet  short  enough  to  preserve  the  spectral  resolution.  For  these  enhancement 
methods  the  evident  problem  is  noise  and  the  motivation  is  to  suppress  the  noise.  The 
SNR  increa.se,  however,  is  still  difficult  to  quantify  in  regards  to  intelligibility.  It  does 


7 


not  provide  a true  subjective  measure  of  noise  suppression  as  measured  by  the  listener. 
Certain  spectral  subtraction  techniques  have  been  extended  with  psychoacoustic  principles 
to  exploit  the  sensitivity  of  human  hearing  [141,  137].  These  perceptual  methods  have  been 
integrated  into  speech  enhancement  systems  to  alleviate  unnatural  distortions  produced  by 
the  suppression  process  and  rely  on  maximizing  a Noise  Masking  Ratio  (NMR)  versus  a 
SNR. 


1.3  Contributions  and  Chapter  Organization 

The  major  accomplishments  and  contributions  of  this  research  are  presented  as  follows: 

1)  Investigated  the  representation  of  loudness  in  the  peripheral  auditory 
system. 

2)  Provided  an  efficient  approximation  to  the  ISO-532B  loudness  model. 

3)  First  known  to  propose  an  algorithm  to  elevate  the  perception  of  loudness 
without  adding  energy. 

4)  First  known  to  use  a warped  filter  for  speech  enhancement  and  establish 
an  optimal  warping  factor. 

5)  First  known  to  provide  an  implementation  of  a filter  to  adjust  formant 
bandwidths  on  a critical  band  frequency  scale. 

6)  Proposed  an  analytic  relation  between  energy  gain  and  loudness. 

7)  Objectively  evaluated  algorithm  performance  through  machine  speech 
recognition  testing. 

8)  Subjectively  verified  loudness,  intelligibility,  and  quality  results  through 
listening  tests. 

In  this  study  we  instantiate  the  auditory  percept  of  loudness  to  be  our  target  variable, 
much  as  noise  is  to  adaptive  filtering.  We  introduce  a speech  enhancement  method  with 
the  following  purpose:  To  increase  the  perceptual  loudness  of  clean  speech  without  adding 
energy  or  degrading  intelligibility.  The  loudness  enhancement  method  is  proposed  for  use 
with  a speaker-phone  configuration  as  the  last  processing  stage  before  amplification.  This 
also  corresponds  to  a point  at  which  noise  suppression  techniques  have  already  been  applied 


and  the  speech  is  considered  clean.  We  do  not  examine  the  contribution  of  noise  since  the 
performance  depends  on  the  noise  suppression  methods.  We  incorporate  the  ISO-532B 
analysis  method  to  objectively  evaluate  the  improvement  in  speech  loudness.  The  ISO- 
532B  defines  a quantitative  procedure  for  calculating  the  loudness  of  steady-state  sounds, 
and  can  be  used  to  precisely  determine  the  specific  loudness  and  loudness  patterns  of  speech. 
Our  technique  focuses  on  vowels  which  are  high  energy,  resonant,  and  longer  in  duration 
than  consonants.  Vowels  are  also  less  susceptible  in  terms  of  perception  to  changes  in 
bandwidth  and  shape  in  comparison  to  consonants  [56].  We  also  propose  an  approximation 
to  this  standard  for  use  as  a control  parameter  in  the  bandwidth  expansion  process.  The 
resulting  loudness  patterns  graphically  demonstrate  an  improvement  in  the  vowel  regions 
of  speech  using  the  proposed  technique.  The  remainder  of  this  section  provides  the  chapter 
organization.  The  chapters  are  enumerated  in  order  of  their  contribution,  and  the  individual 
outlines  summarize  the  main  points  of  each  chapter’s  research. 

Chapter  2:  Models  of  loudness.  This  chapter  presents  a review  of  the  human  auditory 
system  and  the  ISO-532B  loudness  analysis  method.  We  review  the  critical  band  concept, 
the  role  of  auditory  filters,  how  the  excitation  function  is  created,  the  effects  of  partial 
masking,  and  the  role  of  loudness  sensation  in  human  hearing.  The  ISO-532B  is  reviewed 
and  presented  in  detail.  It  is  an  extremely  precise  graphical  method  for  calculating  loud- 
ness. In  this  chapter  we  also  present  an  efficient  approximation  of  the  ISO-532B  for  the 
loudness  analysis.  The  approximation  removes  the  complexities  of  the  ISO-532B  which 
predicts  loudness  results  coherent  with  the  standard.  The  approximation  introduces  two 
replacements  of  the  Perceptual  Linear  Prediction  (PLP)  auditory  model:  1)  critical  band 
frequency  warping,  and  2)  a level  and  frequency  dependent  linear  slope  excitation  function. 
The  frequency  warping  converts  the  spectrum  to  a critical  band  scale  consistent  with  loud- 
ness analysis  in  the  human  auditory  system.  The  slope  function  delineates  the  excitation 
slopes  of  the  auditory  filter  responses  described  by  Moore  and  Glasberg’s  model  of  loudness. 
The  approximation  was  inspired  by  the  psychophysics  of  hearing  in  the  PLP  method  and 
is  similarly  a convenient  method  of  calculating  speech  loudness.  Approximation  errors  for 
the  loudness  approximation  are  presented  in  chapter  3. 


9 


Chapter  3:  Vowel  power.  In  chapter  2 we  observed  the  role  of  auditory  filters  and  exci- 
tation functions  in  masking.  We  learned  that  masking  renders  certain  spectral  components 
inaudible  and  contributes  to  accessory  loudness.  The  effects  of  psychoacoustic  masking 
have  been  favorably  used  for  music  compression  in  MP3  data  encoding.  A principal  ob- 
jective in  exploiting  loudness  is  knowing  how  and  where  the  speech  energy  is  distributed, 
and  if  psychoacoustic  masking  can  be  exploited  for  loudness  enhancement.  In  chapter  3,  we 
provide  the  average  levels  of  masked  power,  total  power,  and  accessory  loudness  recorded 
from  analysis  of  the  TIMIT  test  sentence  database.  We  show  that  vowels  contain  ^ 80% 
of  the  speech  power,  are  relatively  long  in  duration,  have  smooth  spectral  envelopes,  show 
little  masking,  and  provide  the  overall  perceived  loudness  level  of  speech.  For  these  rea- 
sons, we  target  the  vowel  regions  of  speech  as  candidates  for  loudness  enhancement.  We 
also  provide  studies  of  loudness  adaptation,  recruitment,  recalibration,  and  enhancement  as 
related  to  auditory  fatigue.  In  addition,  research  studies  are  provided  to  present  the  effects 
of  manipulating  consonant- vowel  ratios  and  modulation  depths  on  intelligibility. 

Chapter  4:  A warped  bandwidth  expansion  filter.  In  chapter  4 we  review  the  pole 
displacement  model  used  in  linear  prediction  as  a technique  for  bandwidth  expansion.  Chap- 
ter 3 demonstrated  that  vowels  have  high  energy,  moderate  masking,  and  their  loudness 
increases  as  formant  bandwidths  increases.  In  Chapter  4 we  propose  a critical  band  for- 
mant expansion  filter  as  a new  extension  of  the  vocoder  post-filter  to  increase  the  overall 
loudness  perception  of  clean  speech.  The  filter  motivation  is  derived  from  an  understanding 
of  psychoacoustics  and  an  application  of  a fundamental  principle  of  loudness.  For  an  equal 
energy  bandwidth  product,  loudness  will  increase  when  a critical  band  is  exceeded.  The 
technique  applies  formant  bandwidth  expansion  to  the  vowel  regions  of  speech  and  equally 
redistributes  the  energy  to  raise  the  dilated  spectrum  in  vowel  regions  using  a new  warped 
filter  design. 

Chapter  5:  Objective  evaluations.  In  this  chapter  we  perform  ISO-532B  loudness 
analysis  on  TIMIT  tests  sentences  and  words  processed  by  the  warped  filter  of  chapter  4. 
The  loudness  gain  of  the  warped  filter  is  a numeric  ratio  of  the  ISO-532B  loudness  patterns. 
Results  indicate  the  warped  filter  with  critical  band  expansion  provides  the  optimal  loudness 


10 


increase.  A gain  function  which  relates  energy  to  effective  loudness  is  also  proposed.  The 
gain  function  tells  us  the  scaling  we  can  apply  to  elevate  the  original  signal  to  equal  loudness 
as  the  processed  signal.  Results  verify  the  effective  loudness  gain  with  the  true  ISO-532B 
analysis  and  loudness  approximation  presented  in  Chapter  2.  We  also  extend  this  gain 
function  to  an  all-pole  loudness  ratio  and  relate  the  gain  in  loudness  to  the  evaluation 
radius.  A loudness  distortion  measure  is  also  proposed  to  further  simplify  the  gain  function. 
In  closing,  speech  recognition  rating  tests  are  conducted  to  show  the  warped  filter  does  not 
degrade  speech  quality  in  terms  of  machine  recognition  performance. 

Chapter  6:  Subjective  evaluation.  In  this  chapter  we  define  listening  tests  to  subjec- 
tively evaluate  the  improvement  in  overall  loudness  and  the  effect  on  intelligibility.  The 
listening  tests  are  conducted  to  substantiate  the  loudness  gains  and  correlate  the  analytic 
gains  to  perceptual  loudness  gains.  Three  listening  tests  are  performed:  intelligibility,  loud- 
ness, and  acceptability.  The  intelligibility  tests  evaluate  the  discernibility  of  the  speech  at 
noise  degradation  levels  of  0 and  -10  dB.  The  loudness  tests  evaluate  the  effective  loudness 
gain  through  a series  of  scaling  and  comparison  procedures.  The  acceptability  tests  provide 
the  overall  impression  of  speech  quality  through  score  ratings. 

Chapter  7:  Conclusions.  This  chapter  summarizes  the  completion  of  this  research  and 


presents  future  directions. 


CHAPTER  2 

MODELS  OF  LOUDNESS 


In  this  chapter  we  introduce  a loudness  calculation  routine  to  evaluate  the  loudness  of 
speech  with  results  similar  to  the  loudness  analysis  of  the  ISO-532B.  The  ISO-532B  is  a 
graphical  procedure  for  calculating  the  loudness  of  a complex  sound  that  has  been  analyzed 
in  terms  of  one-third  octave  bands.  We  provide  a loudness  approximation  suitable  for  use 
with  speech  systems  that  may  require  a numeric  representation  of  loudness  such  as  the  ISO- 
532B.  The  loudness  method  resembles  Zwicker’s  method  [148],  but  more  closely  follows 
Moore  and  Glasberg’s  model  [93]  of  the  excitation  and  final  loudness  on  a critical  band 
scale.  Moore’s  model  and  Zwicker’s  model  are  almost  identical  for  normal  to  moderate 
sound  levels.  Only  at  low  frequencies  and  levels  near  quiet  do  the  models  differ,  where 
Moore’s  model  alleviates  some  inconsistencies  of  Zwicker’s  model.  One  such  modification  is 
the  use  of  excitation  patterns  derived  from  auditory  filters  in  place  of  Zwicker’s  slope  table. 

The  loudness  approximation  model  incorporates  Moore’s  concept  of  auditory  filters  for 
excitation  patterns  but  on  a discrete  basis.  Moore’s  model  provisions  the  use  of  auditory 
filters  to  generate  a continuous  excitation  pattern  for  loudness  calculation.  This  removes 
some  of  the  discontinuities  of  Zwicker’s  model  created  by  the  one-third  octave  band  de- 
compositions and  the  use  of  tabulated  excitation  slopes.  However,  to  reduce  the  model 
complexity  we  do  not  employ  auditory  filters,  but  model  their  response.  The  auditory  fil- 
ters generate  level  dependent  excitation  responses  which  can  be  characterized  as  a linear 
function  of  level  and  frequency.  In  place  of  these  filters  we  define  an  excitation  function 
representing  the  level  dependent  auditory  filter  responses.  This  simplifies  the  analysis  while 
still  providing  a close  approximation  to  Zwicker’s  model  at  normal  levels. 


11 


12 


We  examine  the  principles  of  the  Perceptual  Linear  Prediction  (PLP)  technique  [56]  as  a 
modelling  procedure  for  our  analysis.  The  PLP  technique  is  consistent  with  the  properties 
of  human  auditory  perception.  It  generates  a representation  of  the  speech  power  spectrum 
in  a manner  similar  to  the  processing  stages  of  the  auditory  system.  The  complete  PLP 
method  is  an  auto-regressive  modelling  technique  typically  used  in  speech  recognition.  Lin- 
ear prediction  is  performed  on  the  generated  perceptual  power  spectrum.  It  is  typically 
used  as  a feature  extractor  since  it  is  relatively  robust  to  changes  in  environmental  condi- 
tions and  to  speaker  variability.  Our  modification  of  the  PLP  consists  of  the  substitution 
of  certain  steps  to  more  properly  condition  the  auditory  spectrum  for  loudness  analysis. 
One  such  step  is  the  replacement  of  the  PLP  interbank  with  the  power  spectral  warping 
method  of  Warped  Linear  Prediction  (WLPC)  filter  design  [130].  This  effectively  stretches 
the  linear  frequency  spectrum  to  a continuous  critical  band  spectrum. 

Results  indicate  the  loudness  model  is  sufficiently  consistent  with  the  predictions  of  the 
ISO-532B.  We  use  the  ISO-532B  as  our  reference  since  it  precisely  calculates  the  loudness 
of  any  complex  spectrum.  The  ISO-532B  is  also  known  as  Zwicker’s  loudness  method. 
This  chapter  begins  with  a review  of  loudness  in  the  auditory  system,  and  the  motivation 
behind  the  procedures  of  the  ISO-532B  loudness  calculations.  We  also  review  the  concepts 
of  critical  bands,  auditory  filters,  masking,  and  excitation  patterns  since  they  are  intimately 
related  to  the  sensation  of  loudness.  Knowledge  of  these  psychophysical  processes  allows 
for  a complete  description  of  the  processing  activity  in  the  auditory  system.  This  is  a 
requirement  to  resolving  the  methodology  of  a loudness  model.  A loudness  model  attempts 
to  capture  the  characteristics  of  the  auditory  system  responsible  for  loudness  sensation. 
By  doing  so,  it  measures  a quantity  in  direct  relation  to  the  percept  of  intensity  we  call 
loudness. 


2.1  Loudness 

Loudness  is  the  human  perception  of  intensity  and  is  a function  of  the  sound  intensity, 
frequency,  and  quality  [50].  Intensity  is  the  amount  of  energy  flowing  across  a unit  area 
surface  in  a second.  It  closely  follows  an  inverse  square  law  with  distance  as  described  by 


13 


Eq(2.1),  where  I is  intensity  and  p is  acoustic  pressure. 

L = 101ogioy^  -^'  = 201ogio^  (2.1) 

The  sound  energy  level  can  be  represented  with  pressure  since  I cx  p^.  When  the  denomi- 
nator values  are  chosen  as  reference  variables  corresponding  to  the  threshold  of  hearing,  the 
decibel  pressure  ratio  becomes  the  sound  pressure  level  SPL  and  the  decibel  intensity  ratio 
becomes  the  intensity  level.  Human  sensations  such  as  hearing  increase  logarithmically  as 
the  intensity  of  the  stimulns  increases  [128].  To  measure  loudness  it  is  necessary  to  estab- 
lish a reference  that  relates  the  subjective  sensation  to  the  physical  meaning.  The  londness 
level  was  created  to  characterize  the  loudness  sensation  of  any  sonnd,  since  magnitude  es- 
timations do  not  provide  an  accurate  representation.  By  definition,  the  loudness  level  of  a 
sound  is  the  sound  pressure  level  of  a 1-KHz  tone  that  is  as  loud  as  the  sound  under  test. 
The  nnit  measure  is  the  “phon”  and  it  is  an  objective  value  to  relate  the  perception  of 
loudness  to  the  SPL.  Any  sounds  with  equal  phon  levels  are  at  equal  loudness  levels.  The 
continuous  frequency  spectrum  can  be  assigned  phon  levels  for  a given  SPL.  The  contour  of 
these  curves  are  known  as  the  equal  loudness  curves  [64]  and  are  seen  in  Figure  2.1.  The  set 
of  curves  for  the  SPL  values  from  the  threshold  of  hearing  to  the  ceiling  of  hearing  defines 
a measure  of  equivalent  loudness  in  phon  at  each  frequency.  The  dotted  line  in  Figure  2.1 
represents  the  threshold  of  hearing  where  the  limit  of  loudness  sensation  is  reached.  This 
occurs  at  the  3 phon  level  since  the  threshold  in  quiet  corresponds  to  3dB  at  IKHz  [149]. 

The  phon,  however,  does  not  provide  a measure  for  the  scale  of  loudness.  A londness 
scale  provides  a nnit  of  measure  stating  how  much  louder  one  sound  is  perceived  in  com- 
parison to  another.  The  phon  level  simply  states  the  SPL  level  required  to  achieve  the 
same  loudness  level.  It  does  not  establish  a metric,  or  unit,  of  loudness.  The  “sone”  was 
introduced  to  define  a subjective  measure  of  loudness  where  a sone  value  of  1 corresponds 
to  the  loudness  of  a IKHz  tone  at  an  intensity  of  40dB  SPL  for  reference  [128].  The  sone 
scale  defines  a scale  of  loudness  such  that  a quadrupling  of  the  sone  level  quadrnples  the 
perceived  loudness.  An  empirical  relation  between  the  sound  pressure  p and  the  loudness 
S in  sones  is  typically  given  by  S'  oc  oc  A ten- fold  increase  in  intensity  corresponds 
to  a 10  phon  increase  in  SPL.  Since  loudness  is  approximately  proportional  to  the  cube  root 


14 


Figure  2.1:  Equal  loudness  curves. 

of  the  intensity,  a 10  phon  increa.se  roughly  corresponds  to  a doubling  of  the  sone  value. 
The  sound  is  perceived  twice  as  loud. 


2.1.1  Critical  Bands 


The  most  dominant  concept  of  auditory  theory  is  the  critical  band  [31].  The  critical 
band  defines  the  processing  channels  of  the  auditory  system  on  an  absolute  scale  with 
our  representation  of  hearing.  The  critical  band  represents  a constant  physical  distance 
along  the  basilar  membrane  about  1.3mm  in  length  [149].  It  represents  the  signal  processes 
within  a single  auditory  nerve  cell  or  fiber.  Spectral  components  falling  together  in  a critical 
band  are  processed  together  [148].  The  critical  bands  are  independent  processing  channels. 
Collectively  they  constitute  the  auditory  representation  of  the  sound.  The  critical  band  has 
also  been  regarded  as  the  bandwidth  in  which  sudden  perceptual  changes  are  noticed  [50]. 
Zwicker  and  Terhardt  [150]  provided  the  following  expression  to  relate  critical  band  rate 
and  bandwidth  to  frequency  in  kHz. 


Bark 


13tan“^(0.76/)  + 3.5tan“^(/)2 


(2.2) 


15 


However,  this  formula  is  not  invertible  in  closed  form,  and  an  invertible  procedure  by 
Traunmuller  [134]  is  given  in  Eq(2.3).  Figure  2.2  shows  the  critical  band  scale  established 
by  both  equations. 


/ 


26.81/7(1960  + /)  -0.53 
' 

z' + 0.15(2.0 -z')  z'<2.0 

< z’  2.0  < z'  < 20.1 

z' + 0.22(z' - 20.1)  z' > 20.1 


(2.3) 


Fletcher’s  original  experiments  on  masking  phenomena  revealed  the  characteristics  of  the 
critical  band  concept.  In  these  well-known  experiments  the  audibility  of  a pure  tone  is  eval- 
uated for  different  noise  bandwidths.  The  experimental  results  demonstrate  that  audibility 
is  affected  only  by  the  amount  of  noise  in  the  critical  band.  As  bandwidth  decreases  be- 
low a critical  bandwidth  the  detection  threshold  of  the  tone  decreases.  These  experiments 
suggested  the  existence  of  an  auditory  filter.  Since  noise  outside  a certain  bandwidth  does 
not  affect  detection  thresholds,  an  auditory  mechanism  which  suppresses  these  components 
seemed  likely.  The  auditory  filter  can  be  considered  a physiological  process  which  sup- 
presses components  outside  the  filter  region  but  does  not  adversely  affect  signals  within  the 
filter.  The  purpose  of  the  auditory  filter  is  to  isolate  signal  components  of  interest  and  to 
attenuate  the  signal  contributions  outside  this  region.  The  region  defined  by  this  boundary 
is  the  critical  bandwidth,  and  the  experimental  results  show  that  this  critical  bandwidth 
increases  with  increasing  frequency  [149]. 

The  critical  band  concept  is  crucial  for  describing  hearing  sensations,  especially  loudness. 
If  the  intensity  of  a sound  is  fixed,  the  loudness  of  sound  remains  constant  as  long  as  the 
bandwidth  is  less  than  a critical  band  [149].  Once  the  bandwidth  is  increases  beyond  a 
critical  band,  loudness  will  increase.  When  the  bandwidth  exceeds  the  critical  bandwidth 
the  loudness  increases  even  though  the  energy  remains  constant.  This  is  based  on  the 
fact  that  our  hearing  system  analyzes  a broad  spectrum  into  parts  that  correspond  to 
critical  bands.  It  is  also  consistent  with  the  auditory  filter  concept  in  which  frequency  is 
continuously  encoded  along  the  basilar  membrane  and  in  which  loudness  is  linearly  related 


16 


Frequency  Hz 


Figure  2.2:  Mapping  of  the  linear  frequency  scale  to  the  critical  band  scale  given  by  Eqs(2.2) 
and  (2.3). 

to  the  area  of  excitation  [11].  The  critical  band  rate  provides  a measure  of  loudness  over 
a continuum  of  frequency  channels.  Since  these  auditory  channels  are  essentially  process 
independent,  their  sum  provides  an  overall  evaluation  of  perceived  loudness. 

By  assigning  each  critical  band  as  a discrete  unit  of  loudness,  it  is  possible  to  assess  the 
loudness  of  a spectrum  by  summing  the  individual  critical  band  units  [148].  The  sum  value 
represents  the  perceived  loudness  generated  by  the  sound  spectrum.  The  loudness  value  of 
each  critical  band  unit  is  a specific  loudness,  and  the  critical  band  units  are  referred  to  as 
Bark  units.  Thus,  1 Bark  interval  corresponds  to  a given  critical  band  integration  [149]. 
The  critical  band  scale  is  a frequency  to  place  transformation  of  the  basilar  membrane.  The 
principle  observation  of  the  critical  band  is  that  it  can  be  interpreted  as  a rate  scale,  i.e., 
loudness  will  not  increase  until  a critical  bandwidth  is  exceeded. 

2.1.2  Auditory  Filters 

Subjective  listening  tests  and  experiments  traditionally  provide  a description  of  the 
auditory  filter  shapes  [106,  149,  92].  The  first  estimates  were  from  the  results  of  tone 
and  noise  masking  experiments  [31].  Fletcher  revealed  the  concept  of  the  critical  band 
and  approximated  the  auditory  filter  that  defined  the  boundary  of  a critical  band  as  a 


17 


rectangular  filter.  The  width  of  an  auditory  filter  is  generally  described  in  terms  of  critical 
bands  for  simplicity.  However,  they  are  not  really  rectangular  in  shape.  The  concept  of 
an  Equivalent  Rectangular  Bandwidth  (ERB)  is  useful  to  describe  the  critical  bandwidths 
[50].  The  ERB  is  a rectangular  filter  with  unit  height  and  bandwidth  that  contains  the 
same  amount  of  power  as  the  critical  band.  Eq(2.4)  provides  an  approximate  expression  of 
the  ERB  for  Eq(2.2)[50j. 

^ = 25  + 75[1  + I A{fKHz)T^^  (2.4) 


The  critical  bandwidth  is  essentially  linear  up  to  about  500Hz  and  then  increases  loga- 
rithmically and  in  proportion  to  center  frequency.  Experimental  testing  procedures  form 
the  interpretation  of  results.  A preferred  experimental  procedure  for  determining  auditory 
filter  shapes  is  the  noise  notch  method  proposed  by  Patterson  [106].  It  favorably  constrains 
the  masking  effects  to  provide  a better  observation  of  the  auditory  filtering  process.  This 
method  restricts  the  auditory  filter  during  testing  to  within  a certain  bandwidth  as  given  by 
the  noise  notch.  It  provides  a way  to  trace  out  the  critical  band  filter  shape.  Patterson  and 
Nimmo  [107]  suggested  the  rounded  exponential  (roex)  function  in  Eq(2.5)  to  parameterize 
the  auditory  filter  shape  which  described  their  experimental  results. 


\H{f)\-^  = {l+pg)e-P^ 


(2.5) 


where  g is  the  normalized  deviation  of  the  evaluation  frequency  to  the  center  frequency,  fc 


5 = !(/  - /c)//c| 


(2.6) 


and  p is  a dimensionless  parameter  which  describes  the  bandwidth  and  filter  slopes.  Moore 
and  Glasberg  proposed  the  parameters  pi  and  p„  to  model  an  asymmetrical  filter  shape  at 
different  input  levels  as  a better  fit  to  the  experimental  data  [94].  The  auditory  filters  are 
approximately  symmetrical  on  a linear  scale  when  the  input  level  of  the  auditory  filters  is 


L = bUBIERB. 


Pifc) 

Puifc) 

Plifc) 


18 


= 4/c/(24.7  + 0.108/c) 

= pifc) 

= pifc)  ( 1 - rri'^  5idB 


(2.7) 


P i^KHz) 

These  modifications  have  been  used  to  generate  nonlinear  models  of  the  peripheral  auditory 
system  [109],  and  for  different  representations  of  the  ERB  bandwidth  leading  to  Lyon’s 
and  Greenwood’s  model  (cited  in  Slaney  [124]).  Moore  and  Glasberg  concluded  that  the 
critical  variable  determining  auditory  filter  shape  wa.s  the  input  level  to  the  filter.  They 
also  provided  ’’corrections”  to  the  outer  to  middle  ear  transfer  function  as  a better  fit  to 
experimental  results.  Figure  2.3  shows  the  simulated  level-dependent  roex  auditory  filter 
responses  for  input  levels  50  to  90dB  at  center  frequencies  of  fc  = lOOHz,  IKHz,  and  3KHz. 
The  low  frequency  auditory  filter  slope  decreases  with  level,  and  the  high  frequency  slope 
slightly  increases  with  level 


Figure  2.3:  Roex  auditory  filters  for  input  levels  50  to  90dB  at  center  frequencies  lOOHz, 
IKHz,  and  3KHz. 


2.1.3  Excitation 

Loudness  is  a function  of  the  excitation  pattern,  and  the  excitation  is  the  residual 
response  of  the  auditory  filters.  The  excitation  pattern  of  a sound  is  a representation  of  the 


19 


activity  or  excitation  evoked  by  that  sound  as  a function  of  characteristic  frequency  [149]. 
The  excitation  pattern  is  used  in  all  models  of  loudness.  There  are  two  general  approaches 
to  determining  excitation  patterns.  The  first  method,  used  in  ISO-532B  [65]  and  proposed 
by  Zwicker,  calculates  the  spread  of  excitation  across  critical  bands  from  the  masking  of  pure 
tones  by  narrowband  noise  as  seen  in  Figure  2.4.  A narrowband  noise  at  a given  frequency 
is  the  masker  and  the  tone  to  detect  is  varied  in  frequency.  The  resulting  threshold  curve 
is  the  masking  pattern.  The  masking  effect  refers  to  the  phenomenon  that  certain  sounds 
become  inaudible  in  the  vicinity  of  louder  neighboring  sounds.  A partial  masking  effect 
reduces  the  audibility  but  does  not  completely  mask  the  sound.  The  masking  patterns 
describe  a masked  threshold  in  relation  to  the  test  tone’s  frequency.  Zwicker  and  colleagues 
suggested  that  the  resulting  masking  patterns  represented  the  evoked  neural  excitation 
[150].  The  ISO-532B  [66]  uses  masking  curve  slopes  from  this  method  in  a charting  routine 
to  calculate  the  spread  of  excitation. 


Figure  2.4:  Example  of  pure  tone  masking  threshold  generated  by  a narrow  band  masker. 

In  the  second  method,  proposed  by  Moore  and  Glasberg  [96],  excitation  patterns  are 
generated  from  auditory  filters.  The  auditory  filter  shapes  determine  the  spread  of  excita- 
tion, not  the  masking  patterns.  The  masking  patterns  reflect  the  use  of  multiple  auditory 
filters,  not  a single  auditory  filter  like  the  critical  band.  In  Moore  and  Glasberg’s  method, 
the  auditory  filter  shape  is  determined  by  finding  the  just  noticeable  tone  level  in  a notch 
of  noise.  This  is  the  notch  noise  method  as  seen  in  Figure  2.5,  which  also  appears  to  be  less 
influenced  by  auditory  events  that  contribute  to  the  masking  effects  of  Zwicker’s  method. 
The  notch  noise  method,  which  allows  variation  of  the  notch  center,  favorably  restricts  anal- 
ysis to  a single  auditory  filter.  Gollectively,  the  auditory  filter  shapes  are  used  to  generate 


20 


the  excitation  pattern,  which  can  be  considered  as  the  output  of  the  auditory  filters  as  a 
function  of  their  center  frequency. 


frequency 


Figure  2.5:  Example  of  noise  notch  method  to  trace  out  auditory  filter  shapes. 


Figure  2.6,  for  example,  shows  the  derived  excitation  pattern  of  a IKHz  sinusoid  tone 
from  the  simulated  roex  filters.  The  evoked  excitation  is  generated  by  the  contributing 
outputs  of  the  continuous  auditory  filterbank.  The  signal  component  falls  within  different 
auditory  filters,  each  of  which  responds  according  to  its  filter  shape.  Even  though  the  audi- 
tory filters  at  this  level  are  symmetrical  on  a linear  frequency  scale  the  resulting  excitation 
pattern  is  not.  Auditory  filter  bandwidths  increase  with  increasing  frequency  and  are  not 
linearly  spaced.  These  characteristics  generate  the  asymmetrical  excitation  functions  which 
show  a more  pronounced  upward  spread  of  excitation  [40] . 


Figure  2.6:  Generation  of  excitation  function,  a)  individual  auditory  filter  responses  from 
a IKHz  sinusoid  input,  and  b)  resulting  excitation  pattern. 


21 


Experimental  measurements  of  the  auditory  filter  shapes  using  the  noise  notch  method 
reveal  the  variation  of  shape  with  level  [55].  If  the  auditory  filters  were  linear,  then  their 
shape  would  not  change  with  the  level  of  the  input  noise,  which  they  do.  These  observations 
led  to  the  inclusion  of  the  level  dependent  term  for  calculating  the  upper  auditory  filter 
slopes  in  Eq(2.7),  and  as  shown  in  Figure  2.3.  The  excitation  patterns  for  various  dB 
levels  of  a IKHz  input  sinusoid  on  a critical  band  scale  are  shown  in  Figure  2.7.  The 
excitations  are  generated  from  the  outputs  of  the  Roex  auditory  filters  described  by  Eq(2.7) 
and  calculated  in  the  same  manner  as  the  excitation  function  of  Figure  2.6.  It  can  be  seen 
that  the  excitation  slopes  of  Figure  2.7  are  approximately  linear  with  respect  to  power  level 
on  a critical  band  scale.  The  absolute  threshold  of  hearing  curve  as  the  dashed  line  is 
described  by  Eq(2.22). 


O 2 4 6 8 1012141618  20  22  24 

critical  band  rate 


Figure  2.7:  Excitation  level  versus  critical  band  pattern  for  IKHz  tone.  Threshold  in  quiet 
indicated  by  dashed  line. 


2.2  Measuring  Loudness 

Sound  level  meters  typically  employ  the  standard  frequency  weighting  characteristics 
to  more  closely  approximate  loudness  versus  SPL.  The  standard  A-weighting  has  a fre- 
quency dependence  that  corresponds  to  that  of  the  equal  loudness  contours  at  low  levels. 
It  deemphasizes  the  extremes  of  the  low  and  high  frequency  levels  so  as  to  compensate  for 
the  reduced  loudness  perception  in  those  regions.  It  only  approximates  loudness  levels  for 
sinusoidal  tones  or  narrow  band  noises  at  lower  levels.  The  dB(A)  values  of  noises,  complex 


22 


tones,  or  combinations  are  perceptually  misleading  when  used  for  the  subjective  valuation 
of  loudness  [54],  At  higher  SPL  the  B- weighting  and  C- weighting  curves  provide  a better 
representation  of  subjective  loudness,  since  they  do  not  de-emphasize  the  lower  and  higher 
extremes  as  severely  as  the  A-weighted  measure.  However  these  measures  are  based  strictly 
on  frequency  weighting  and  do  not  apply  critical  band  rates  as  the  auditory  system  does  in 
evaluating  perceived  loudness. 

Third-octave  band  filters  provide  a sufficient  estimation  to  critical  band  analysis  and 
are  widely  employed  for  acoustical  measurements.  Two  procedures  for  calculating  loudness 
using  third-octave  bands  are  the  DIN-45631  and  ISO-532  standards.  Since  third-octaves 
approximate  the  ear’s  frequency  selectively  they  can  be  employed  similarly  to  critical  band 
integration  as  a measurement  of  loudness.  The  DIN  45631  provides  a procedure  for  calcu- 
lating loudness  based  on  human  perception  but  is  preferred  for  determining  the  loudness  of 
steady  state  sounds.  For  sounds  with  strong  temporal  variation  such  as  speech,  the  ISO- 
532  measures  are  preferred.  There  are  two  standard  methods  proposed  by  the  ISO-532  to 
produce  appropriate  measures  of  loudness  based  on  the  human  perception  of  sound.  The  A 
method  is  based  on  the  time  function  of  total  loudness,  and  provides  a percentage  of  time 
during  which  a given  loudness  is  reached  or  exceeded.  The  B method  is  based  on  specific 
loudness  versus  critical  band  rate. 

2.2.1  Power  Law  of  Hearing 

The  following  section  describes  the  derivation  of  the  loudness  equation  used  in  the 
ISO-532B  [149].  The  total  loudness,  N,  of  a sound  is  produced  by  summing  the  specific 
loudnesses,  AI',  along  the  critical  band  rate  scale.  The  specific  loudness  components  are 
incrementally  added  up  along  the  critical  band  scale,  similar  to  how  the  auditory  system 
integrates  loudness  over  frequency.  The  specific  loudness  is  a function  of  the  critical  band 
rate,  z,  and  is  termed  a loudness  distribution,  or  loudness  pattern.  The  loudness  pattern 
produces  a curve  under  which  the  area  of  the  summation  is  a direct  measure  of  perceived 
loudness. 


23 


r2i  Bark 

N=  N'dz  (2.8) 

Jo 

The  perception  of  intensity  is  called  loudness.  The  sound  intensity  is  defined  as  the 
amount  of  sound  energy  flowing  across  a unit  area  surface  in  a second  and,  with  the  as- 
sumption of  a point  source  and  no  reflections,  follows  an  inverse  square  law  with  distance. 
Steven’s  law  states  sensations  of  intensity  grow  as  a power  law  of  physical  intensity,  and  as 
a result,  a relative  change  in  loudness  may  be  assumed  proportional  to  a relative  change 
in  intensity  [128].  Loudness  listening  test  experiments  have  shown  that  equal  ratios  of  in- 
tensities lead  to  equal  ratios  of  loudness  estimates.  Using  specific  loudness  in  place  of  total 
loudness  and  excitation  in  place  of  intensity,  the  following  relation  holds: 


AN'  , AE 


(2.9) 


where,  the  excitation  E is  an  intermediate  value  which  describes  the  masking  contribution 
of  the  auditory  filter  slopes  on  a critical  band  rate.  It  provides  a better  approximation  than 
intensity  to  our  frequency  selective  hearing.  Eq(2.9)  represents  an  equation  of  differences 
which  leads  to  the  power  law  of  hearing. 

log  N'  = k log  E 

N'  = E’^  (2.10) 


For  low  values  of  N'  and  E the  internal  noise  floors  can  be  included. 


N'  + Ngr  = {E  + Egrf 


(2.11) 


Assuming  the  boundary  condition  that  E — 0 leads  to  N'  — 0,  we  must  normalize  by  the 
noise  floors,  respectively. 


N'  + Ngr 
^gr 


E + E, 


gr 


E, 


gr 


k 


(2.12) 


24 


Solving  for  specific  loudness,  we  get  the  equation 


N'  = Ngr 


(1  + E/Egr)’^  - 1 


(2.13) 


Nq  is  necessary  as  a reference  specific  loudness  to  Ngj-,  and  Eq  is  the  reference  excitation 
produced  by  a sound  at  OdB  SPL. 


N' 

^ or 


Egj- 


(2.14) 


The  threshold  factor  is  included,  s,  to  use  the  hearing  threshold  in  quiet  produced  by  the 
internal  excitation  noise 


Egr  = Etq/s 


(2.15) 


Inserting  these  substitutions  in  Eq(2.13)  provides  the  final  loudness  equation. 


1 + 


sE 


Etq  J 


V-. 


(2.16) 


For  moderate  to  high  levels  of  excitation  E the  influence  of  Etq  is  negligible  and  specific 
loudness  can  be  simplified  to 

k 


N'  ^ No 


E 


(2.17) 


Loudness  is  considered  proportional  to  a power  of  the  excitation  intensity,  Eq(2.10) 


N'  = cE'^ 


(2.18) 


where  c is  a constant.  Zwicker  and  colleagues  found  k = 0.23  to  provide  the  best  fit  to 
observed  results  from  pure  tone  masked  by  narrowband  noise  experiments.  For  k = 0.3  the 
compressive  nonlinearity  provides  a close  fit  to  tones,  and  for  k = 0.23  it  is  a close  fit  to 
noise  maskers  [149].  Equations  (2.11)  through  (2.16)  were  provided  to  better  match  the 
loudness  measurements  in  low  intensity  conditions  where  rapid  changes  in  loudness  occur. 
Eq(2.16)  is  a modification  of  the  general  power  law  of  Eq(2.18)  to  include  low  level  loudness 
calculations.  For  moderate  to  high  levels  of  E the  additional  terms  are  negligible.  At  low 
levels,  it  accounts  for  the  steep  drop  in  observed  loudness  near  threshold. 


25 


Moore  et  al.  [95]  have  modified  the  loudness  equation  of  Eq(2.16)  to  more  suitably 
represent  hearing  selectivity  at  levels  near  quiet, 

^ / rn  \ k 


N'  = c 


EV  (EtqV 
Eo)  \Eo  J 


for  E > Exq 


(2.19) 


In  this  equation,  loudness  approaches  zero  as  E approaches  Exq  and  becomes  zero  when 
the  excitation  reaches  threshold.  There  are  two  favorable  consequences  to  this  simple  mod- 
ification of  the  loudness  equation.  The  steep  drop  in  observed  loudness  near  threshold  is 
accounted  for  in  the  equation,  meaning  low  levels  near  threshold  are  better  modelled  in 
regards  to  experimental  loudness  measurements  [95].  This  allows  for  the  rapid  growth  of 
loudness  in  high  threshold  regions,  such  as  the  low  frequency  regions.  And,  as  the  excitation 
increases,  the  threshold  is  also  almost  negligible  in  the  calculation. 


2.2.2  Loudness  and  Bandwidth 

Moore  and  Glasberg’s  model  of  loudness  addresses  the  following  changes  to  Zwicker’s 
model:  1)  re-examination  of  the  low  frequency  attenuations  in  the  outer  to  middle  ear 
filter,  2)  the  evaluation  of  excitations  based  on  analytic  expressions  of  asymmetric  level 
dependent  auditory  filters,  and  3)  to  account  for  the  loudness  growth  near  quiet  by  the 
proposed  relation  of  specific  loudness  to  excitation  in  Eq(2.19).  Moore  and  Glasberg’s 
revision  of  Zwicker’s  loudness  model  was  introduced  to  better  account  for  the  way  that 
equal  loudness  contours  change  with  level.  Their  model  also  provides  a good  explanation 
as  to  why  the  loudness  of  a sound  of  fixed  intensity  remains  constant  when  the  sound  has 
a bandwidth  less  than  the  critical  bandwidth. 

Zwicker’s  experimental  results  concluded  that  loudness  was  independent  of  bandwidth 
for  bandwidths  less  than  the  critical  bandwidth.  And,  when  the  bandwidth  exceeds  a critical 
band,  loudness  increases.  Zwicker’s  model  of  loudness  assumes  excitation  patterns  for  all 
sounds  within  a critical  band  are  the  same  [95].  The  excitation  patterns  were  obtained 
from  masking  patterns  of  pure  tones  masked  by  narrowband  noises.  Moore  and  Glasberg’s 
model  derives  excitation  patterns  from  auditory  filter  responses  whose  shapes  were  derived 
from  data  obtained  by  noise  notch  experiments.  Their  description  of  the  excitation  pattern 


26 


through  auditory  filter  analysis  provides  an  alternate  view:  loudness  remains  constant  below 
a critical  bandwidth  not  because  the  excitations  are  identical,  but  because  the  total  specific 
loudness  due  to  excitation  is  constant.  When  the  bandwidth  exceeds  a critical  band,  the 
contribution  of  the  specific  loudness  due  to  broadening  of  the  excitation  increases.  The  area 
increase  from  the  broadening  of  the  excitation  is  greater  than  the  area  decrease  in  effective 
amplitude.  The  contribution  of  the  specific  loudnesses  is  thus  greater  as  compared  to  when 
the  bandwidth  was  less  than  the  critical  band. 

50  r 
“40f 

fso 

20 


10 


15 


Figure  2.8:  Relation  between  loudness  and  bandwidth  a)  input  narrowband  noise  centered 
at  IKHz  with  bandwidths  40,  80,  160,  320,  640  and  1280Hz  all  at  constant  60dB  SPL  b) 
corresponding  excitation  patterns,  and  c)  resulting  loudness  pattern. 


For  illustration,  we  replicate  their  simulation  results  [95]  using  the  auditory  filters  of 
Eq(2.7).  Figure  2.8  shows  the  excitation  and  loudness  patterns  of  narrowband  noise  centered 
at  IKHz  with  bandwidths  of  40,  80,  160,  320,  640  and  1280Hz  all  at  a constant  overall  level 
of  60dB  SPL.  As  can  be  seen  from  the  figure,  for  bandwidths  between  20  and  160Hz  the 
decrease  in  specific  loudness  area  below  the  peak  is  about  the  same  as  the  slight  increase 
along  the  skirts.  In  this  range  the  total  area,  or  loudness,  is  relatively  constant.  For 
bandwidths  above  160Hz  (the  critical  bandwidth  of  a IKHz  tone),  the  increase  in  specific 
loudness  area  along  the  skirts  due  to  the  excitation  broadening  is  greater  than  the  decrease 


27 


in  area  below  the  peak.  In  this  case  the  loudness  increases.  Moore  and  Glasberg’s  model 
provide  predictions  of  loudness  close  to  empirically  obtained  results,  and  more  accurate 
than  those  of  Zwicker’s  model  [95].  Their  model  provides  an  emphasis  on  the  frequency 
selectivity  of  the  hearing  system,  and  has  shown  success  at  predicting  the  variation  of 
loudness  with  respect  to  intensity,  frequency,  and  bandwidth. 


Loudness  Adds 


c 

o 

S 

o 

X 


LU 


80dB 

80dB 

,'i 

1 

1 

/'< 

i\ 

\ 

t 

$ 

/ 

> 

- * 

! 

t 

t 

\ 

\ 

frequency 


80/10 

1 = 10  (Intensity) 

E =10log,ol 

T = 2(cE  ^ = C37.6 


c 

.2 

5 

o 

X 

LU 


Loudness  is  sum  power 


80dB 


frequency 


80/10 

1 = 10  (Intensity) 

E = 10  log, 0 2I 
0.3 

T =cE  = cl  8.8 


Figure  2.9:  Loudness  of  tones  separated  by  a critical  band. 

Critical  bands  act  as  independent  processing  channels  [50].  As  a result,  loudness  is 
dependent  not  only  on  signal  level  and  bandwidth,  but  also  frequency.  A simple  example 
serves  to  show  the  power  of  critical  band  separation  on  perceived  loudness.  Figure  2.9 
demonstrates  the  loudness  of  two  tones  of  equal  energy  for  a)  being  separated  by  more  than 
a critical  band,  and  b)  being  within  the  same  critical  band.  The  compressive  nonlinearity 
described  by  power  law  of  hearing  reveals  that  the  loudness  of  two  tones  separated  by  a 
critical  band  will  be  louder  than  the  two  tones  within  a critical  band.  Interestingly,  the 
loudness  of  the  two  tones  is  double  when  separated  by  a critical  band. 


2.2.3  Outer  to  Middle  Ear  Filter 

The  frequency  selectivity  of  the  outer  to  middle  ear  is  intimately  related  to  the  percep>- 
tion  of  loudness.  The  first  stage  of  a loudness  model  is  to  include  the  transfer  characteristics 
of  the  outer  to  middle  ear.  The  outer  ear  transmission  includes  the  form  of  the  head,  the 


28 


outer  ear,  and  the  outer  canal  which  provides  our  high  frequency  sensitivity.  The  middle 
ear  begins  with  the  ear  drum  and  acts  cis  a pressure  receiver  to  convert  sound  intensities 
to  physical  movements.  The  intensity  of  sound  is  a small  air  force  oscillation  over  a large 
displacement,  and  the  required  physical  movements  are  large  forces  over  small  areeis.  The 
physical  movements  are  conveyed  to  the  inner  ear  where  physical  motion  is  converted  to 
wave  motions.  This  complete  interaction  defines  an  impedance  matched  transformation 
which  is  extremely  efficient  in  the  human  auditory  system.  This  transmission  is  denoted 
the  outer  to  middle  ear  transfer  function,  and  is  normally  introduced  as  a logarithmic  at- 
tenuation curve  ^0-  It  represents  the  transmission  characteristics  the  sound  undergoes  as 
it  travels  from  the  free  field  to  that  sound  being  active  internally. 


H{z) 

Hhp{z) 

Hlp{z) 


Hlp(z)Hhp{z) 

1 - 2^-1  + 

1 - 2i?z-i  + 

0.109(1  +z~^) 

1 - 2.5359.2“^  + 3.9295^-2  - 4.7532z“^  + 4.7251z“'^ 

-3.5548^-^  + 2.1396Z-®  - 0.9879^-^  + 0.2836^-* 


(2.20) 


(2.21) 


The  outer  to  middle  transfer  function  has  been  modelled  from  experimental  listening 
test  results  and  measurements.  Several  authors  have  shown  adjustments  (re-examinations) 
to  the  equal  loudness  contours  published  in  ISO-226.  A parameterized  model  of  the  outer  to 
middle  transfer  function  has  been  proposed  by  Pflueger  et  al.  [109]  and  given  in  Eq(2.20) 
for  fs  = AAAkhz  to  account  for  these  deviations  with  the  parameter  R.  The  responses 
model  a general  set  of  attenuation  curves  Aq  between  the  inverted  lOOphon  equal  loudness 
contour  (topmost)  and  the  inverted  absolute  threshold  of  hearing  curve  (bottommost).  The 
transmission  is  characterized  by  the  cascade  of  a low  pass  filter  and  high  pass  filter.  The 
8th  order  HR  LPF  determines  the  overall  shape,  and  the  high  pass  filter  determines  the  low 
frequency  attenuation.  The  R factor  sets  the  low  frequency  response  below  IKhz.  Figure 
2.10  shows  the  filter  at  values  of  R=0.94  to  0.99  in  increments  of  0.10  for  fs  = AAAkhz. 

Zwicker’s  model  of  loudness  assumes  an  outer  to  middle  ear  transfer  function  which 
was  flat  below  2Khz,  and  followed  the  form  of  the  inverted  absolute  threshold  curve  above 


29 


Figure  2.10:  Outer  to  middle  ear  filter  given  by  Eq(2.20)  for  various  values  of  R. 

2Khz.  He  assumed  the  low  frequency  thresholds  below  2Khz  were  the  complete  result  of 
internal  low-frequency  noise,  and  therefore  the  attenuation  should  not  reflect  the  elevated 
threshold  in  this  region.  In  Moore  and  Glasberg’s  model  the  assumed  transmission  function 
from  the  outer  to  the  middle  ear  is  based  on  the  inverted  lOOphon  equal-loudness  contour 
for  frequencies  below  IKhz,  and  on  the  inverted  absolute  threshold  curve  for  frequencies 
above  IKhz.  This  is  based  on  their  assumption  that  the  inner  ear  has  an  internal  noise  floor 
which  rises  with  level  in  accord  with  the  outer  to  middle  ear  transmission.  This  allows  the 
internal  noise  floor  to  rise  with  level  similarly  to  the  inverted  equal  loudness  levels. 

Zwicker  assumed  no  low  frequency  noise  floor,  and  the  low  frequency  threshold  increase 
was  strictly  due  to  increasing  internal  noise  with  level.  Like  Zwicker,  Moore  and  Glasberg 
also  assume  the  inner  ear  is  equally  sensitive  to  frequencies  above  IKHz.  They  propose 
a Alter  shape  in  this  region  as  the  inverted  absolute  threshold  curve.  The  lOOphon  and 
absolute  threshold  curve  on  which  the  Minimum  Audible  Field  (MAF)  is  based  are  also 
approximately  equivalent  above  IKhz.  The  absolute  threshold  of  hearing  can  also  be  ap- 
proximated by  the  following  equation  where  / is  expressed  in  KHz  [55], 

AdBif)  = 3.64/““'®- -t- (2.22) 


30 


2.3  Calculating  ISO-532B  Loudness 

There  are  two  standard  methods  proposed  by  the  ISO-532  to  produce  appropriate  mea- 
sures of  loudness  based  on  the  human  perception  of  sound-the  phon  and  sone  level.  The  A 
method  is  based  on  the  time  function  of  total  loudness,  and  provides  a percentage  of  time 
during  which  a given  loudness  is  reached  or  exceeded.  The  B method  is  based  on  specific 
loudness  versus  critical  band  rate.  The  ISO-532B  method  calculates  loudness  using  the 
same  temporal  characteristics  as  the  human  auditory  system.  It  is  also  known  as  Zwicker’s 
loudness,  and  it  is  the  only  ISO  standard  rated  for  loudness  calculations. 

The  ISO-532B  [65,  66]  calculates  loudness  in  three  steps:  1)  main  specific  loudness 
calculation,  2)  loudness  due  to  excitation  masking,  and  3)  specific  loudness  summation. 
The  ISO-532B  calculates  loudness  from  the  1/3  octave  band  analysis  of  the  sound  under 
examination.  At  the  time,  critical  band  filters  were  rarely  implemented,  especially  as  digital 
filters,  and  1/3  octave  band  analog  filters  were  readily  available.  Also,  1/3  octave  bands  were 
reasonably  close  to  critical  bands  except  for  low  frequencies,  and  the  bandwidth  differences 
between  the  scales  were  accounted  for  in  the  loudness  calculation.  These  constraints  resulted 
in  the  ISO-532B  being  a careful  graphical  procedure  that  ultimately  generates  a loudness 
pattern  from  the  1/3  octave  band  levels  [148].  The  area  under  this  pattern  corresponds  to 
the  total  loudness  as  stated  in  Eq(2.16). 

2.3.1  Specific  Loudness 

This  excitation  level  {Le)  is  used  in  the  calculation  of  the  ISO-532B  main  specific  loud- 
ness N' . The  excitation  level  is  first  calculated  from  each  of  the  1/3  octave  bands.  It  is  set 
equal  to  the  1/3  octave  band  power  spectrum  under  evaluation.  In  this  first  step,  the  lower 
1/3  octave  bands  are  also  combined  together,  and  their  excitation  levels  are  calculated  in 
their  respective  critical  bands.  The  excitation  level  then  includes  the  Zwicker’s  attenua- 
tion (Aq)  to  account  for  the  outer  to  middle  ear  transform.  The  attenuation  resembles  the 
inverted  shape  of  the  absolute  threshold  contour,  except  below  2KHz  where  it  is  assumed 
zero.  This  provides  the  intensity  relation  between  the  measured  free  field  and  the  internal 


31 


representation  of  the  sound. 


Le  ^ Le  — Ao 

where  =>  corresponds  to  an  update  of  the  excitation  level  for  the  attenuation  factor.  Le  also 
accounts  for  the  level  difference  {ALd)  between  diffuse  and  free  sound  fields  by  including 
an  additional  attenuation  Le- 


Le  Le  — ALd 

The  excitation  level  also  balances  the  bandwidth  differences  by  applying  an  attenuation 
[ALq)  to  the  critical  band  level  when  necessary  in  the  analysis. 

Le  ^ Le  — ALq 

After  the  transmission  and  attenuation  factors  have  been  subtracted  in  the  log  magnitude 
domain,  the  excitation  level  and  excitation  threshold  {LesV  Eire  used  to  determine  the 
main  specific  loudness  from  Eq(2.16).  This  is  the  loudness  equation  of  the  ISO-532B. 


N'  = H 


sf 


(^1  + 0.25  * 10^^® 


0.25 


(2.23) 


Hsf  = 0.064  * 

Le  = 10log{E/Eo)  dB 

where  E is  the  measured  SPL  in  dB,  and  Eq  = \ since  normalization,  as  previously  shown, 
accounted  for  the  excitation  threshold.  Basically,  once  the  transmission  attenuations  have 
been  subtracted,  the  main  specific  loudness  is  calculated  directly  from  Eq(2.16)  using  the 
excitation  thresholds  and  the  1/3  octave  band  spectrum  levels  Lq  where  subscript  G indi- 
cates critical  band  rate. 


denotes  excitation  E at  threshold  factor  s 


1 


32 


2.3.2  Slope  Excitation 

The  second  major  step  of  the  ISO-532B  loudness  calculation  examines  loudness  due  to 
slope  excitation.  The  excitation  slopes  corresponds  to  the  subjectively  measured  slopes 
of  masked  thresholds.  In  Zwicker’s  model  these  thresholds  were  developed  from  masking 
patterns  of  narrowband  noises.  In  contrast,  the  notch  noise  method  develops  excitation 
patterns  directly  from  auditory  filter  shapes  with  excitation  evaluated  on  an  ERB  scale. 
The  masked  thresholds  are  the  peripheral  excitation  levels  generated  by  the  main  excitation. 
The  main  excitations  are  used  in  conjunction  with  slope  tables  to  calculate  specific  loudness 
areas.  These  slopes  are  a function  of  excitation  level  and  frequency.  Masking  thresholds  are 
graphically  calculated  using  charted  masking  slopes  from  the  main  excitations.  Masking 
thresholds  greater  than  the  excitation  level  in  a critical  band  replace  the  corresponding 
excitation  level  in  that  critical  band. 

The  first  step  in  the  ISO-532B  generates  a 1/3  octave  band  loudness  vector  in  sones 
using  Eq(2.23).  Recalling  that  the  excitation  slopes  on  a log  magnitude  Bark  scale  are  ap- 
proximately linear,  the  slopes  should  be  similar  for  a given  level  on  a log  magnitude  scale.  In 
the  slope  table  they  are  not.  Since  the  first  step  of  the  ISO-532B  generates  a representation 
of  loudness,  it  has  to  precalculate  the  masking  slopes  to  account  for  the  compression.  Thus, 
the  table  slopes  correspond  to  the  experimental  observation  that  masking  slopes  decrease 
with  increasing  level  when  the  data  is  on  a log  magnitude  bark  scale.  In  their  code  [66],  the 
slopes  are  precalculated  to  include  the  compression  of  Eq(2.23).  These  slopes  are  used  to 
calculate  the  accessory  loudness  due  to  masking  and  include  that  contribution  in  the  final 
measure  of  loudness.  The  ISO-532B  is  an  iterative  graphical  procedure  which  progressively 
calculates  total  loudness  along  the  critical  band  scale.  It  calculates  total  loudness  as  the 
area  under  the  curve  of  the  generated  excitation  function.  This  is  the  final  step  and  is 
simply  the  summation  of  all  the  specific  loudness  values.  In  the  ISO-532B  the  sone  level  is 
also  converted  to  the  phon  level  using  Eq(2.28). 


33 


2.3.3  Discussion 

Loudness  is  in  units  of  sones  (•0)  and  the  constant  Nq  = 0.064  of  Eq(2.23)  ensures  that 
a IKHz  sine  wave  tone  at  40dB  sets  the  scale  to  1 sone.  It  should  be  noted  that  the  total 
loudness  must  include  the  accessory  (masked)  loudness  calculated  in  the  second  step  of  the 
ISO-532B.  Thus,  setting  of  the  constant  is  the  last  step  in  deriving  the  loudness  standard. 
Recall  that  a IKHz  sinwave  tone  at  40dB  is  required  to  give  a loudness  of  1 sone.  For 
illustration,  we  can  set  the  intensity  power  law  of  Eq(2.10)  with  a compression  of  fc  = 0.3 
and  unknown  factor  c.  The  parameter  c sets  the  scale  reference  for  a IKHz  tone  at  40dB 
to  1 sone. 


= c/'"  (2.24) 

c = = 0.0631  (2.25) 

(2.26) 

The  ISO-532B  provides  the  loudness  level  of  the  sound  in  sones.  A loudness  function 
relating  loudness  level  in  sones  to  loudness  level  in  phons  is  shown  below.  By  definition  any 
tone  with  a loudness  level  of  40phon  corresponds  to  a loudness  of  1 sone. 


logio(^)  = logio(c)  + (fc/10)L 
logio(4')  = -1.2  + 0.03L 
log2  4^  = In  T/  In  2 

log2’I'  = ^(-1.2  + 0.03L) 


(2.27) 


(2.28) 


and  solving  for  'F,  we  get  the  loudness  level  in  phones  [L  = 10  log^o  -f/Lo)-  Eq(2.28)  is  used 
in  the  ISO-532B  to  relate  loudness  in  sones  to  loudness  level  in  phons.  It  is  identical  to 
that  given  in  the  ISO/R  131. 

Also,  in  the  ISO-532B  analysis,  the  compression  factor  k is  set  to  0.25,  instead  of 
k = 0.23  the  optimal  value  by  Zwicker,  to  reduce  the  additional  loudness  contributed  by 


34 


the  lower  frequency  1/3  octave  band  cut-off  slopes.  This  is  a consequence  of  using  1/3 
octave  filters  instead  of  critical  band  filters.  Also,  the  threshold  factor,  s = 0.25  provides 
the  most  sensitive  value  of  just  noticeable  amplitude  modulation.  This  is  the  value  used 
in  the  ISO-532B  analysis.  It  is  the  just  noticeable  change  in  intensity  variation,  and  is  a 
function  of  the  modulation  degree,  m. 


s = 


~T 


1 + m 
1 — m 


(2.29) 


s = 0.25  corresponds  to  the  just  noticeable  degree  of  modulation  (m=6%)  with  a logarithmic 
threshold  factor  of  ^ IdB  [149].  This  occurs  at  the  highest  center  frequency  where  the 
critical  bands  are  much  wider.  Signal  amplitude  fluctuations  in  a larger  band  are  more 
noticeable  than  in  a smaller  band. 


2.4  Simplifying  the  Loudness  Model 

We  have  seen  that  the  ISO-532B  is  a tedious  but  extremely  accurate  graphical  procedure 
that  precisely  calculates  the  loudness  of  complex  spectral  distributions.  It  is  complicated 
by  the  fact  that  it  relies  on  a 1/3  octave  band  analysis.  The  ISO-532B  is  primarily  meant 
for  relatively  steady  state  sounds,  and  does  not  directly  account  for  temporal  integration 
effects  of  loudness,  i.e;  the  change  in  loudness  due  to  the  subjective  duration  of  the  sounds. 
The  burdensome  calculations  of  the  ISO-532B  are  the  iterative  calculations  of  the  accessory 
loudness  due  to  slope  excitation.  It  implements  a forward  step  procedure  which  determines 
the  contribution  of  accessory  loudness  progressively  across  a critical  band.  Loudness  is 
calculated  from  left  to  right  on  the  critical  band  scale.  Essentially  the  procedure  has  to 
keep  track  of  the  local  slope,  excitation  level,  and  where  it  is  in  relation  to  where  the  critical 
band  ends. 

Our  interest  is  to  develop  a simplified  loudness  approximation  that  is  coherent  with 
a loudness  analysis  of  the  ISO-532B.  A loudness  approximation  would  serve  useful  in  a 
speech  enhancement  or  recognition  system  which  makes  beneficial  use  of  the  loudness  level. 
The  Perceptual  Linear  Prediction  (PLP)  technique  is  a front  end  feature  extractor  for  LPC 
based  speech  recognition  applications.  It  generates  a feature  vector  in  a manner  similar 


35 


to  how  the  auditory  system  attempts  to  encode  sound  information.  The  PLP  technique 
adheres  to  the  same  processing  stages  of  the  auditory  system.  The  PLP  method  has  been 
shown  to  be  consistent  with  the  sensitivity  of  human  hearing  to  changes  in  several  important 
speech  parameters.  The  psychoacoustic  principles  of  the  PLP  technique  make  it  a suitable 
candidate  for  use  as  a low-complexity  loudness  analysis  method.  We  briefly  review  the  PLP 
method  for  preparation  of  extending  it  to  our  loudness  model. 

2.4.1  PLP  Technique 

The  Perceptual  Linear  Prediction  (PLP)  technique  [56]  incorporates  three  concepts  of 
the  psychophysics  of  hearing:  1)  critical  band  analysis,  2)  equal- loudness  pre-emphasis, 
and,  3)  the  intensity-loudness  power  law.  The  PLP  method  is  as  follows: 

a)  The  20ms  speech  frame  s(n)  at  fs  = 10k Hz  is  weighted  by  a Hamming  window  of 
length  N, 


W{n)  = 0.54  + 0.46  cos[27m/(AT  - 1)]  (2.30) 

b)  The  short  time  power  spectrum  is  calculated 

P{w)  = i?e[S'(u;)]^  4-  Im[S{w)]‘^  (2-31) 

c)  The  spectrum  P{w)  is  warped  along  its  frequency  axis  to  the  Bark  scale 

H(?u)  = 6 loge  [u;/12007r  + [{w/UOOKf  + 1]°-^]  (2.32) 

d)  The  warped  power  spectrum  is  convolved  with  a critical  band  masking  function  T(r2) 
to  mimic  auditory  smearing, 

2.5 

0(Qj)  ^ P(fi  - Qj)4'(f2) 

n=-i.3 


(2.33) 


36 


where 


f 


0 


n > -1.3 


jQ2.5(n+0.5)  _2  3 < Q < _0.5 


= 1 


-0.5  < 0 < 0.5 


(2.34) 


102.5(n-o.5)  0.5  < Q<  2.5 


0 


Q > 2.5 


e)  The  critical  band  power  spectrum  0(fi(ta))  is  preemphasized  by  E{w),  the  simulated 
equal  loudness  curve  at  the  40dB  level 


S[n(ra)]  = TJ(m)0[n('u;)] 


(2.35) 


(2.36) 


f)  A cubic  root  compression  is  applied  as  an  approximation  of  the  power  law  of  hearing 


These  are  the  primary  steps  in  representing  the  signal  spectrum  as  an  auditory  spectrum 
before  linear  prediction.  The  next  steps  convert  the  auditory  spectrum  to  an  autoregressive 
model.  A low  order  IDFT  is  taken  to  generate  the  autocorrelation,  and  the  normal  equations 
are  solved  in  the  usual  way.  For  loudness  we  are  not  concerned  with  these  last  steps. 

2.4.2  Extending  PLP  for  Loudness 

It  should  be  noted  that  step  d)  does  not  provide  a complete  representation  of  loudness. 
The  masking  function  4'(fl)  is  a general  representation  of  the  smearing  properties  of  the 
auditory  filters.  Also  in  step  e),  the  equal  loudness  contours  should  vary  slightly  with  level. 
The  use  of  a general  smearing  function  is  acceptable  for  most  auditory  modelling  approaches, 
and  it  does  not  affect  the  PLP  for  its  intended  purpose.  However,  we  are  concerned  with 
extending  the  PLP  method  for  loudness  analysis.  The  critical  band  masking  slopes  of  T(fi) 
are  not  dependent  on  the  magnitude  or  frequency  of  P(Q).  In  particular,  as  seen  in  Figure 
2.11  and  in  Eq(2.34),  the  PLP  filter  slopes  are  constant  for  different  levels.  The  loudness  of 
sounds  is  a complex  function  of  the  main  excitation  and  the  masked  threshold  as  we  have 


$(w)  = H(fl)^/^ 


(2.37) 


37 


seen.  The  masking  threshold  is  determined  by  the  filter  slopes  which  have  been  shown  not  to 
be  constant  at  different  levels  as  was  seen  in  Figure  2.7.  A suggestion  would  be  to  replace 


Figure  2.11:  16  weighting  functions  used  to  compute  0(Q(ra)). 


the  masking  function  with  a frequency  and  level  dependent  function  to  account  for  the 
accessory  loudness.  The  implementation  of  Moore’s  continuous  auditory  filters  presented 
in  Figure  2.3  would  precisely  satisfy  this  requirement  [109,  96].  However,  such  analytic 
functions  are  not  computationally  trivial.  This  is  the  reason  a general  smearing  function  is 
employed. 

As  an  alternative,  a procedural  approach  can  be  used  to  generate  a masking  threshold 
using  the  upper  and  lower  excitation  slopes.  The  methodology  is  similar  to  the  ISO-532B 
but  without  the  complication  of  the  slope  table  calculations.  We  have  seen  in  Figure  2.7  that 
the  excitation  slopes  are  approximately  linear  on  a log  magnitude  Bark  scale.  Hauenstein 
has  suggested  the  use  of  magnitude  and  frequency  dependent  linear  slopes  for  excitatory 
modelling  as  a low  complexity  operation  for  loudness  analysis  [51].  The  lower  frequency 
slope  is  set  constant  on  the  Bark  scale  51  = 27dB / Bark,  and  the  upper  frequency  slope  is 
described  by 


52  = 


24  + 


0.766 


-0.2L 


dB 

Bark 


(2.38) 


(g^/6  _ g-^/6) 

where  z is  the  Bark  index  z = 0 to  24,  and  L is  the  measured  SPL  level  in  dB.  The  frequency 
dependent  term  contributes  most  at  low  frequencies,  and  the  SPL  is  the  governing  factor 
of  the  slope.  The  masked  excitation  can  be  determined  from  these  slopes  to  provide  a level 
and  frequency  dependent  masking  threshold  in  place  of  the  critical  band  masking  function 


38 


As  presented  earlier,  the  roex  filters  provided  a good  approximation  of  the  generated 
excitation  functions  with  regards  to  experimental  observations.  It  has  been  generally  as- 
sumed that  the  low  frequency  excitation  slopes  are  constant  with  level.  Moore  and  Glasberg 
suggest  that  these  slopes  do  in  fact  change  slightly  with  level.  To  better  account  for  this 
observation  and  to  provide  a slightly  better  match  with  the  roex  auditory  model  excitations, 
we  have  included  a slight  modification  to  the  lower  frequency  slope  Eq(2.39). 

Figure  2.12  shows  our  linear  slope  approximations  in  comparison  to  the  excitation  functions 
generated  by  the  level  dependent  auditory  filters  of  Eq(2.7)  for  a IKHz  sinusoid. 


Figure  2.12:  Linear  approximation  to  excitation  slopes  generated  by  roex  auditory  filters. 


2.4.3  The  Loudness  Approximation 

We  have  examined  the  primary  functions  of  auditory  analysis  in  the  human  hearing 
system.  Now  we  proceed  with  the  steps  for  our  loudness  approximation, 
a)  Critical  band  warping  of  the  power  spectrum:  We  used  the  Oppenheim  recursion  [110, 
130]  on  the  autocorrelation  sequence  for  the  frequency  warping  transformation  using 


39 


Eq(2.40).  a — 0.56  approximates  critical  band  warping  for  fs  = IQKHz. 


For  0 < n < N { 


(2.40) 


For  2 < k < N { 


The  warped  power  spectrum  is  obtained  by  the  FFT  of  the  warped  autocorrelation 
sequence,  ffc. 


We  warp  the  response  to  a Bark  scale  since  Fq(2.38)  operates  on  a critical  band  scale. 
Figure  2.13  shows  an  example  of  the  frequency  warping.  The  recursion  is  used  as  a 
warping  technique  in  the  design  of  WLPC  filters. 


O 2 4 e 8 1012141618  20 

Barks 

Figure  2.13:  Frequency  warping  using  Oppenheim  recursion  on  autocorrelation  sequence. 

b)  Next  we  proceed  with  critical  band  summation.  There  are  20.7  bark  intervals  z with 
fs=16KHz  using  Fq(2.32).  For  simplicity,  using  21  Bark  intervals,  an  unconventional 
630  point  FFT  in  step  a)  allows  15  integral  samples  per  Bark  in  the  summation  of  F(Q). 
The  FFT  size  is  not  a power  of  2 and  is  not  optimal  in  computational  efficiency.  As  an 


P(n)  = lOlogio  (^27Zeal[R(k)]'j 


Original 


Warped 


40 


alternative,  a polyphase  filterbank  can  be  employed  for  the  critical  band  filtering  [51]. 

z + l 

= ^ = 0...20 

Vt~z 

c)  Include  the  outer  to  middle  ear  sensitivity  specified  by  Hip{z)  in  Eq(2.20).  For  con- 
venience, we  use  the  discrete  critical  band  frequency  response  Aq  at  each  critical  band 
center  frequency  as  seen  in  Figure  2.14. 

8 

6 

4 

2 
CD 
■o 

O 

-2 
-4 
-6 

Figure  2.14:  Outer  to  middle  sensitivity  characteristics. 


5 1 0 15  20 

Bark 


The  sensitivity  is  added  in  the  warped  spectrum.  In  the  ISO-532B  the  sensitivity  is 
represented  as  an  attenuation  and  is  subtracted,  since  it  is  the  inverted  outer  to 
middle  ear  response. 

0(Q,)  = 0(f2,)  + io 


d)  Calculate  the  level  and  frequency  dependent  excitation  slopes  for  each  critical  band 


52(.j) 
51  (z) 


0-766 

24  -|-  - — 

(e  - e 

15  + (0(Q,)/1O) 


O.20(Q^) 


41 


e)  Construct  an  interim  excitation  vector  Le{^)  for  each  critical  band  z in  the  matrix 
Le{^,z).  Here  Q is  the  discrete  sequence  of  critical  bands 


Le{^,z) 


{k  — z)  ■ 5l(^) 
< 0 

{z-k}-  S2{z) 


k < z 
k = z 
k > z 


f)  Select  maximum  excitations  across  all  critical  band  elements  of  the  excitation  matrix 
Le{^.z) 


Le{^)  = max  {Le{^,z)} 

E(n^)  = 


Figure  2.15,  as  another  example,  shows  the  interim  excitations  generated  by  an  arbi- 
trary The  combined  excitation  is  the  envelope  (maximum)  of  the  interim 

excitations. 


Figure  2.15;  Determination  of  maximum  interim  excitations. 


g)  Include  the  power  law  of  hearing  using  Moore  and  Glasberg’s  model  of  loudness.  The 
3-phon  equal  loudness  contour  described  by  Eq(2.22)  is  used  for  the  threshold  of  hearing 
Letq-  The  elements  are  the  equal  loudness  threshold  attenuations  in  dB  at  the  critical 
band  center  frequencies  as  shown  in  Figure  2.16. 


Etq  = 

AT'(Q,)  = c[E(D,)0-23 


(2.41) 


42 


ao 

15 
io 
“ 5 

o 

-5 
-1  O 

O 5 1 0 15  20 

Bark 

Figure  2.16:  Absolute  threshold  of  hearing. 

h)  Calculate  total  loudness  from  summation  of  critical  band  specific  loudness  values. 

20 

(2.42) 

z— 0 


2.4.4  Model  Discussion 

The  outer  to  middle  ear  transmission  is  intimately  related  to  the  equal  loudness  con- 
tours as  we  have  seen,  and  there  is  still  debate  on  its  true  form.  For  consistency  with 
Zwicker’s  model  of  AO  we  exclude  the  low  frequency  attenuations  below  2KHz  introduced 
by  H}jp{z),  in  comparison  to  including  the  internal  noise  floor  described  by  Moore.  For  our 
implementation,  we  simply  use  the  equal  loudness  contour  described  by  Hip{z)  in  Eq(2.20). 
The  attenuation  Aq  was  sampled  on  a critical  band  rate  from  this  response.  This  step  can 
be  seen  as  replacement  of  the  40dB  equal  loudness  preemphasis  step  in  the  PLP  method. 
The  preemphasis  is  a result  of  our  unequal  sensitivity  to  frequency,  and  has  been  better 
modelled  with  the  level  dependent  excitation  slopes. 

The  excitation  slope  calculations  replace  the  general  smearing  function  of  the  PLP 
method.  The  slopes  are  a function  of  level  and  frequency.  Since  the  slopes  are  linear 
on  a log  scale,  the  smeared  power  of  each  interim  excitation  can  be  calculated  by  slope 
subtraction  on  the  dB  scale,  or  by  a repeated  multiplication  on  the  magnitude  scale  [51]. 
This  is  computationally  convenient.  The  combined  excitation  at  each  specific  point  ^ can 


43 


be  defined  cis  the  maximum  of  all  interim  excitations  that  stimulate  that  point.  It  can  be 
interpreted  as  the  total  neural  activity  evoked  by  the  stimulus  [11].  The  combined  excitation 
is  the  envelope  of  the  interim  excitations,  and  is  used  in  the  final  loudness  calculation.  The 
total  loudness  is  the  sum  of  the  specific  loudnesses  which  collectively  represent  the  total 
neural  output  generated  by  the  stimulus. 


Figure  2.17:  Loudness  predictions  of  the  ISO-532B  {dotted)  and  the  warped  loudness  ap- 
proximation {solid). 

Figure  2.17  shows  the  inverted  equal  loudness  curves  predicted  by  the  described  loudness 
model  and  the  ISO-532B  loudness  model  for  comparison.  Each  curve  represents  a frequency 
sweep  of  a sinusoid  from  lOOHz  to  5KHz  in  50Hz  increments  at  SPL  levels  of  40,  50,  60,  70, 
80,  and  90dB.  Loudness  approximation  errors  are  presented  in  Chapter  3 for  test  sentences 
of  the  TIMIT  database.  In  the  ISO-532B  analysis,  a 1 second  sinusoid  at  each  specified 
frequency  (fs=16KHz)  is  1/3  octave  band  filtered  into  28  bands  containing  the  computed 
RMS  powers  in  dB  at  the  corresponding  preferred  labelling  frequencies  (standard  ANSI 
SI. 6-1984).  The  Matlab  Octave  toolbox  was  used  for  this  procedure.  The  1/3  octave  band 
powers  were  passed  to  the  ISO-532B  for  loudness  analysis.  We  transcribed  the  IS0-532B 
code  in  [66]  into  matlab  equivalent  code,  and  have  provided  it  in  Appendix  D.  The  ISO-532B 
provides  the  phon  and  sone  level  output. 

The  loudness  contours  generated  by  the  procedures  of  section  2.4.3  are  shown  as  solid 
lines  for  comparison.  We  use  the  parameter  values  experimentally  provided  by  Moore  and 
Glasberg  [95]  in  Eq(2.41):  k = 0.21  and  c = 0.08.  They  seem  to  provide  a close  fit  to 
Zwicker’s  model,  seen  as  the  ISO-532B  contours  in  Figure  2.17.  It  can  be  seen  that  the 


44 


responses  are  not  identical,  but  are  reasonable  approximations  at  moderate  levels,  60  to 
80dB.  At  low  SPL  levels  the  relative  difference  is  more  pronounced,  occasionally  greater 
than  a 1 sone  difference.  The  discontinuities  in  the  contours  are  the  result  of  discretizing  the 
excitation  into  a critical  band  spectrum  As  a final  example.  Figure  2.18  shows  the 

loudness  analysis  of  a TIMIT  sentence  using  the  ISO-532B  and  the  loudness  approximation. 
This  shows  that  the  approximation  can  provide  a reasonably  close  fit  to  the  ISO-532B  for 
general  loudness  analysis. 


Dont  ask  me  to  carry  an  oily  rag  like  that 


Figure  2.18:  Loudness  prediction  of  ISO-532B  {dotted)  and  approximation  {solid} 

In  this  chapter  we  have  presented  a loudness  approximation  to  the  ISO-532B  which  is 
consistent  in  principle  to  the  loudness  analysis  of  the  auditory  system.  The  procedures 
differ  with  the  ISO-532B  in  the  following  ways: 

1.  The  analysis  operates  directly  on  a critical  band  spectrum,  in  comparison  to  the 
ISO-532B  which  operates  on  the  one-third  octave  band  spectrum. 

2.  The  excitation  slopes  of  the  auditory  filter  responses  are  approximated  by  a linear 
level-dependent  function.  In  contrast,  the  ISO-532B  evaluates  excitation  level  from 
tabulated  slope  data  of  masked  thresholds. 

3.  Total  loudness  is  calculated  using  Moore  and  Glasberg’s  model  to  account  for  the 
loudness  growth  near  threshold. 

The  loudness  procedure  we  have  described  provides  a means  of  calculating  loudness  in  a 
manner  similar  to  the  processing  stages  of  the  auditory  system.  It  is  a methodical  procedure, 
inspired  by  the  PLP  method,  which  attempts  to  capture  an  approximation  of  loudness 
coherent  with  the  ISO-532B  standard.  For  practical  purposes  it  seems  to  well  characterize 


45 


the  predictions  of  the  532B  standard.  The  simplicity  removes  some  of  the  complexities 
of  the  ISO-532B,  that  at  the  time,  were  practical  and  effective.  The  graphical  procedures 
and  charting  tables  were  replaced  with  analytic  equations.  Similarly  however,  our  method 
also  operates  on  a discrete  basis,  and  does  not  provide  an  exact  representation  of  loudness. 
The  continuous  models  described  by  Moore  and  Glasberg  necessitate  the  implementation 
of  auditory  filters,  which  in  respect,  are  slightly  more  involved.  This  would  alleviate  the 
discontinuities  and  allow  a continuous  representation  of  the  loudness  pattern.  In  closing, 
our  model  is  a computationally  convenient  way  of  approximating  loudness  patterns  similar 
to  predictions  of  the  ISO-532B. 


CHAPTER  3 
VOWEL  POWER 


Fig(3.1)  shows  the  classic  results  of  Peterson  and  Barney’s  1953  vowel  experiment.  It 
graphically  shows  the  first  two  formant  frequency  distributions  of  10  American  English  vow- 
els spoken  by  76  different  speakers.  The  figure  represents  a vowel  space  and  demonstrates 
the  variability  of  pronunciation  style  and  formant  frequency.  The  vowel  space  diagram  il- 
lustrates where  a vowel  sound  is  located  both  in  the  acoustic  and  articulatory  space.  The 
acoustic  space  describes  the  formant  frequency,  and  the  articulation  space  describes  the 
vowel  articulatory  configuration. 


Figure  3.1:  Average  formant  locations  for  vowels  in  American  English  (Peterson  and  Barney, 
1952). 


46 


47 


3.1  Vowels 

Vowels  are  associated  with  a steady-state  articulatory  configuration  and  are  typically 
characterized  by  the  first  three  formants.  Vowels  are  classified  according  to  the  position 
of  the  tongue,  shape  of  the  lips,  and  duration  [60].  The  configuration  is  described  by  the 
tongue  position  (back,  middle,  front)  and  height  (high,  middle,  low).  The  front  vowels  are: 
/i,I,ae,e/;  mid  vowels:  /a,A,D/;  and  high  vowels:  /U,u,0/.  Duration  is  described  as  long 
or  short.  Long  is  synonymously  described  as  being  tense,  and  short  as  being  lax.  Long 
vowels  tend  to  have  proportionally  short  vowel  to  consonant  transitions  known  as  short 
off  glides,  and  short  vowels  tend  to  have  long  off  glides.  As  can  be  seen  from  Figure  3.1, 
the  front  vowels  have  relatively  high  second  and  third  formants,  the  middle  vowels  have 
well-separated  and  balanced  formants,  and  the  back  vowels  have  most  of  the  energy  in  the 
low  frequency  region.  There  is  noticeable  overlap  as  expected,  and  when  there  is  confusion 
it  is  usually  with  an  adjacent  vowel.  The  primary  acoustic  cues  in  vowel  perception  are 
formant  frequency  location,  bandwidth,  amplitude,  and  duration.  Figure3.2  shows  the 
average  formant  locations  and  bandwidths  for  the  ten  vowels  of  Figure  3.1  [21]. 

The  widely  accepted  formant  hypothesis  was  formed  by  the  classic  study  of  Peterson  and 
Barney  [60] . The  formant  hypothesis  states  that  speech  formants  provide  the  primary  cue  to 
vowel  perception.  Additionally,  the  first  two  to  three  formants  provide  vowel  discrimination, 
and  the  second  and  third  formant  generally  discern  the  intelligibility  of  the  vowel  [67]. 
Studies  also  suggest  that  spectral  shape  is  a secondary  measure  of  vowel  perception  when 
formant  peaks  are  not  sufficiently  prominent  [60] . Studies  suggest  that  alteration  of  formant 
frequency  location  can  affect  phonetic  quality,  whereas  bandwidth  or  spectral  tilt  would  not 
affect  phonetic  quality  [2].  Bandwidth  adjustment  or  spectral  tilt  is  noticed  as  a change 
in  the  speaker  characteristics.  Spectral  variations  which  affect  the  peak  locations  severely 
affect  the  phonetic  interpretation  of  the  vowel  spectrum  [67].  Hearing  also  seems  to  be 
about  3 times  less  sensitive  to  bandwidth  changes  than  to  changes  in  formant  frequency 
[56]. 


48 


3500 


3000 


2500 


r2000 


1500 


1000 


500 


- 

- 

3010 

- 

-28 

2550 

2480 

2410 

2440 

2410 

- 

2390 

_ 

2290 

-27 

-24 

2240 

2240 

-22 

-34 

-27 

1990 

-43 

-34 

- 

1840 

- 

1720 

1690 

-17 

- 

-12 

-20 

1350 

1190 

-15 

1090 

1020 

-10 

870 

730 

840 

-12 

660 

5?0 

640 

530 

-1 

- 

390 

2 

-1 

0 

300 

440 

-1 

270 

-1 

-5 

-4 

-3 

0 10  20  30  40  50  60  70  80  90  100 

/i/  /!/  /E/  /©/  /a/  Id  /U/  Id  IN  /R/ 

Figure  3.2:  Average  formant  locations  and  bandwidths  for  vowels  in  American  English  with 
corresponding  dB  drop  of  formant  amplitude  from  60dB  reference  [21], 

3.1.1  Synthetic  Model 


We  designed  a three  pole  formant  synthesizer  as  a cascade  of  2nd  order  resonant  filters  to 
model  our  ten  vowels.  This  is  the  linear  model  of  speech  production  for  formant  production 
proposed  by  Fant  and  discussed  in  Chapter  4 [30].  With  this  design  configuration  we  are 
able  to  set  the  formant  frequencies,  bandwidths,  and  peak  amplitudes  of  the  general  vowel 
parameters  given  in  Figure  3.2.  It  also  allows  us  to  conveniently  expanded  the  bandwidths 
directly  from  the  synthesis  equations.  We  are  also  able  to  exclude  pitch  dependencies  and 
glottal  filter  effects  since  we  directly  operate  on  the  spectral  envelope.  For  illustration,  this 
filter  model  provides  a direct  means  of  manipulating  the  general  vowel  characteristics  and 
evaluating  loudness  as  a function  of  formant  bandwidth. 

Consider  a second-order  system  with  a pole  aX  z = one  at  the  conjugate  location, 
and  two  zeros  at  0.  This  system  can  model  the  response  of  a single  formant. 

1 


Hi{z)  = 
H^{z)  = 


(1  — re^^z  ^)(1— re  ^^z 
1 


1 — 2r  cos  0z~^  + 


49 


The  gain  of  the  system  is  given  by  Gj  = Hi{z)\e=o.  If  the  poles  are  well  separated,  the 
bandwidth  Bi  of  a complex  pole  Z{  can  be  approximated  as  a function  of  radius  by, 

Bi  = -\n{r)fgl-K  (3.1) 


The  three  formant  synthesizer  is  the  combination  of  the  three  second  order  sections 


In  actuality  it  was  necessary  to  include  two  additional  poles  at  the  second  and  third  formant 
locations  to  precisely  match  the  peak  formant  amplitudes.  These  additional  poles  are 
extremely  broad  and  strictly  serve  to  elevate  the  the  formants  to  the  peaks  given  in  Figure 
3.2.  The  three  formant  pole  bandwidths  are  not  altered  by  this  procedure.  Figure  3.3  shows 
the  synthesized  vowel  envelopes  whose  characteristics  satisfy  the  vowels  attributes  of  Figure 
3.2. 


Figure  3.3:  Five  pole  formant  synthesis  of  10  American  English  vowel  spectra  {y-axis  in 
dB,  x-axis  is  0-4 KHz). 


The  ISO-532B  is  used  to  evaluate  the  accessory  loudness  due  to  masking  in  the  synthetic 
vowels.  The  IS0532B  analysis  describes  the  total  perceived  loudness  of  a sound  and  the 


50 


amount  of  accessory  loudness  due  to  masking.  Figure  3.4  shows  the  loudness  patterns 
and  masking  patterns  of  the  10  synthetic  vowels  described  by  Figure  3.2  and  generated  by 
the  extended  five  pole  formant  synthesizer  of  Eq(3.2).  The  shaded  regions  correspond  to 
the  regions  of  masked  loudness.  The  upper  right  value  corresponds  to  the  vowel  loudness 
level  in  sones,  and  the  lower  right  value  corresponds  to  the  masked  loudness  level.  The 
standard  vowel  attributes  described  by  formant  frequency,  bandwidth,  and  amplitude,  seen 
in  Figure  3.4  show  minor  effects  of  masking.  The  upper  level  number  in  each  subplot 
corner  corresponds  to  the  total  loudness  in  sones,  and  the  lower  number  corresponds  to 
the  accessory  loudness  in  sones.  The  contribution  of  loudness  due  to  masking  for  these  10 
synthetic  vowels  is  around  5%  of  the  total  loudness  on  average. 


0 5 10  15 

Critical  Bands 

Figure  3.4:  ISO-532B  vowel  loudness  patterns  with  accessory  loudness  due  to  masking  in 
shaded  regions. 


3.1.2  Masking  Effects 

Vowels  are  described  by  their  peak  location  and  relative  amplitudes.  Vowel  identification 
appears  to  be  closely  related  to  the  formant  frequencies  of  the  Vowel  Masking  Patterns 
(VMP),  more  so  than  to  their  amplitudes  or  to  inter-peak  characteristics  [132],  Vowel 
masking  patterns  have  been  used  to  describe  the  excitation  activity  evoked  by  a vowel,  and 


51 


how  the  formant  structure  of  a vowel  is  represented  internally  [62].  Masking  effects  in  voiced 
regions,  such  as  the  vowels,  have  also  been  exploited  for  speech  coding  [123].  In  chapter 
4.3.1,  we  will  see  that  the  human  auditory  system  is  more  perceptable  to  disruptions  in  the 
valley  regions  of  voiced  speech  which  allows  for  lower  bit  rate  coding  schemes.  The  premise 
of  VMP  experiments  is  to  understand  abnormalities  of  cochlear  origin.  Electrophysiological 
evidence  demonstrates  the  peak  to  trough  formant  ratios  are  clearly  encoded  in  the  firing 
of  the  normal  cochlear  nerve  fibers  [145].  Vowel  masking  patterns  describe  the  internal 
auditory  representation  of  the  vowel  spectra  of  hearing  impaired  listeners  in  comparison  to 
normal  hearing  listeners.  The  ma.sking  patterns  trace  out  the  auditory  systems  frequency 
sensitivity  and  selectivity.  The  masking  patterns  can  be  interpreted  as  suppression,  meaning 
if  something  is  masked,  there  is  a physiological  mechanism  which  suppresses  its  audibility.  It 
has  been  suggested  that  individuals  with  hearing  impairments  lack  this  internal  suppression 
[97].  Masking  patterns  of  hearing  impaired  individuals  tend  to  be  shallow,  and  the  reduction 
in  contrast  is  generally  attributed  to  a failure  of  this  suppression  [80] . 

Moore  and  Glasberg  have  performed  listening  test  experiments  to  determine  the  effects 
of  simultaneous  and  forward  masking  in  synthetic  vowels  [97].  Masking  patterns  have  been 
compared  for  both  types  of  masking  at  various  vowel  levels.  The  vowel  is  used  as  the 
masker,  and  the  generated  masking  patterns  are  evaluated  to  1)  examine  how  well  the 
formant  structure  of  vowels  are  represented  internally,  and  2)  to  determine  the  effects  of 
suppression  in  spectral  contrast  enhancement.  The  first  point  meaning  how  well  does  the 
auditory  system  represent  the  signal  spectrum,  and  the  second,  to  examine  the  temporal 
effects  of  suppression  on  masking.  One  concluding  point  of  their  investigation  reveals  that 
the  internal  representation  of  steady-state  vowels  for  both  types  of  masking  effects  changes 
relatively  little  over  a 40-dB  range  level.  Even  at  high  levels  of  90dB  there  is  only  a slight 
adjustment.  If  the  masking  patterns  do  not  vary  much  with  level  in  vowels,  we  should  not 
expect  the  accessory  loudness  due  to  masking  to  change  much  relatively  with  level. 

In  Chapter  2 we  described  the  masking  phenomena  of  the  auditory  system  and  its  role  in 
loudness.  We  also  noted  that  Zwicker’s  masking  curves  were  based  on  the  masking  of ’’pure 
tones  by  narrow  band  (critical  band)  noise.”  His  model  assumed  loudness  was  independent  of 
bandwidth  for  bandwidths  less  than  a critical  band,  and  he  assumed  that  excitation  patterns 


52 


within  a critical  band  were  the  same.  We  also  showed  how  Moore  and  Glasberg’s  model  of 
the  auditory  filters  accounted  for  this  observation,  and  the  reason  loudness  increases  when 
a critical  band  is  exceeded.  Zwicker  also  provided  numerous  results  on  ’’pure  tones  masked 
by  pure  tone”  experiments.  In  these  maisking  experiments  two  observations  precipitated: 

1)  the  low  frequency  masking  slope  becomes  less  steep  with  decreasing  masker  level,  and 

2)  the  high  frequency  masking  slope  becomes  shallower  with  increasing  level  of  the  masker. 
This  behavior  reveals  that  the  peak  masked  threshold  is  lower  for  pure  tone  maskers  in 
comparison  to  narrow  band  maskers  [149].  In  essence,  the  tonality  of  the  masker  establishes 
an  offset  to  the  masking  threshold.  Figure  3.5  illustrates  the  threshold  offset  for  both  a 
narrowband  masker  and  a pure  tone  masker  at  equal  power  levels. 


frequency 


dB  Critical  band  masker 


frequency 


Figure  3.5:  a)  Tone  and  b)  narrowband  masking  thresholds. 


The  concept  of  a Spectral  Flatness  Measure  (SFM)  is  useful  for  describing  the  tonality  of 
speech  and  the  masking  offset.  The  SFM  is  one  measure  to  calculate  the  Auditory  Masking 
Threshold  (AMT),  which  is  the  main  excitation,  displaced  in  amplitude  as  a function  of 
tonality  [70].  The  AMT  is  a spectral  offset  of  the  main  excitation  and  used  to  evaluate 
the  total  masked  power  in  speech.  The  AMT  has  been  used  for  perceptual  noise  criteria 
in  audio  coding  [120,  137],  psychoacoustic  compression  schemes  such  as  MP3  [3],  and  in 
perceptual  vocoder  designs  [69].  Certain  noise  suppression  routines  incorporate  the  AMT  in 
their  spectral  subtraction  processes  [135].  The  inaudible/audible  decisions  generated  by  the 
AMT  are  used  to  carefully  suppress  noise  without  decreasing  intelligibility  or  introducing 
speech  distortion  [140]. 


53 


The  SFM  in  Eq(3.3)  describes  the  statistical  characteristics  of  the  power  spectrum.  It 
is  the  ratio  of  the  geometric  mean  to  the  arithmetic  mean. 

SFM  = 1 - (3.3) 

fc=i 

For  pure  tone  signals  the  SFM  approaches  unity  and  for  white  noise  signals  the  SFM 
approaches  zero.  The  threshold  is  offset  as  a function  of  the  SFM  value  and  critical  band 
number  z. 

0{z)  = SFM  • (14.5  + z)  + 5.5  ■ (1  - SFM)  (3.4) 

The  AMT  is  then  described  by  [136] 

AMT{z)  = iQiogioE{z)-0{z)lw  (35) 

where  E[z)  describes  the  main  excitation.  The  warped  loudness  approximation  in  Chapter 
1 can  be  used  to  determine  the  excitation.  The  procedure  for  calculating  the  AMT  is  very 
similar  to  that  of  calculating  loudness.  To  calculate  the  AMT  we  perform  steps  a)  through  f) 
of  the  loudness  approximation  in  chapter  2.4.3.  This  provides  the  main  excitation  function 
E[z)  for  the  masking  threshold.  Calculation  of  the  offset  corresponds  to  steps  4)  through 
6)  of  the  AMT  algorithm  in  Appendix  D,  given  above.  A tonal  signal  will  have  a larger 
offset  than  a non-tonal  offset  as  determined  by  the  SFM  value.  Figure  3.6  shows  the  AMT 
for  a vowel  frame  of  speech.  The  AMT  describes  which  frequency  components  fall  below 
the  audible  threshold.  In  this  figure  we  see  the  level  of  masking  is  about  4.6%  of  the  total 
power  for  a particular  vowel  region  of  speech. 

The  psychoacoustic  model  of  the  MPEG-1  Layer  III  (MP3)  encoding  standard  incorpo- 
rates the  AMT  for  compression  [10].  The  process  evaluates  tonal  and  non-tonal  frequency 
components  and  eliminates  neighboring  components  which  fall  below  the  masking  threshold 
[105].  Only  those  components  above  the  threshold  are  considered  for  coding.  The  AMT 
provides  an  assessment  of  the  inaudible  frequency  components  for  music  compression.  In 


54 


Frequency 

Figure  3.6:  Auditory  Masking  Threshold. 

the  following  sections  we  will  evaluate  the  masked  power  in  vowels,  and  the  contribution  of 
accessory  loudness. 


3.1.3  TIMIT 

The  DARPA  TIMIT  Acoustic-Phonetic  Continuous  Speech  Corpus  (TIMIT)  was  de- 
signed to  provide  speech  data  for  the  acquisition  of  acoustic-  phonetic  knowledge  and  for 
the  development  and  evalnation  of  automatic  speech  recognition  systems.  TIMIT  contains 
a total  of  6300  sentences,  10  sentences  spoken  by  each  of  630  speakers  from  8 major  dialect 
regions  of  the  United  States:New  England,  Northern,  North  Midland,  South  Midland  South- 
ern, New  York  City,  Western,  and  Army  Brat  (moved  around).  A speaker’s  dialect  region  is 
the  geographical  area  of  the  U.S.  where  they  lived  during  their  childhood.  The  text  material 
in  the  TIMIT  consists  of  450  phonetically-compact  sentences  designed  at  MIT,  and  1890 
phonetically-diverse  sentences  selected  at  TI.  The  dialect  sentences  were  meant  to  expose 
the  dialectal  variants  of  the  speakers  and  were  read  by  all  630  speakers.  The  phonetically- 
compact  sentences  were  designed  to  provide  a good  coverage  of  pairs  of  phones.  The  TIMIT 
provides  word  transcriptions  of  the  words  in  the  sentences  and  time-aligned  phonetic  tran- 
scriptions of  the  sentence  material.  The  transcriptions  are  text  labelled  in  an  accessory 
file  and  provide  the  sample  point  ranges  of  the  hand-segmented  labels.  The  phon  file,  for 
instance,  contains  a table  of  all  the  phonemic  and  phonetic  symbols  used  in  the  TIMIT 
lexicon  and  in  the  phonetic  transcriptions 


55 


Table  3.1;  TIMIT  TEST  phoneme  occurrences  (N),  power  (P%),  accessory  Loudness  (aL%), 
masked  Power  (mP%),  and  sone  loudness  approximation  error  (E%) 


Stops 

N 

P 

aL 

rnP 

E 

Affricates 

N 

P 

aL 

mP 

E 

b 

1852 

0.13 

5.0 

9 

0.67 

jh 

542 

0.45 

6.6 

43 

0.38 

d 

1930 

0.13 

5.1 

22 

0.61 

ch 

527 

1.30 

6.5 

50 

0.38 

g 

1705 

0.11 

5.3 

8 

0.59 

mean 

6.6 

46 

0.38 

P 

1933 

0.15 

4.4 

16 

0.56 

total 

1069 

0.276* 

t 

2257 

0.35 

5.2 

39 

0.39 

k 

2130 

0.29 

5.0 

22 

0.52 

Fricatives 

N 

P 

aL 

mP 

E 

dx 

1793 

0.27 

5.8 

6 

0.7 

s 

2928 

2.88 

6.7 

38 

0.36 

q 

2035 

0.86 

6.2 

4 

0.57 

sh 

1863 

2.36 

6.5 

45 

0.37 

bd 

1816 

0.03 

9.3 

10 

0.41 

z 

2076 

1.02 

7.1 

32 

0.31 

del 

2284 

0.03 

8.9 

12 

0.37 

zh 

1642 

0.78 

7.1 

41 

0.35 

gel 

1696 

0.03 

8.8 

8 

0.36 

f 

1906 

0.12 

4.3 

53 

0.28 

pel 

1950 

0.01 

7.9 

17 

0.61 

th 

1659 

0.08 

4.9 

39 

0.31 

tel 

2718 

0.02 

8.4 

19 

0.54 

V 

1789 

0.19 

5.4 

11 

0.49 

kel 

2433 

0.01 

7.3 

16 

0.52 

dh 

1882 

0.12 

5.4 

13 

0.47 

mean 

6.6 

16 

0.53 

6.0 

33 

0.37 

total 

28532 

1.245* 

15745 

3.84* 

Vowels 

N 

P 

aL 

mP 

E 

Nasals 

N 

P 

aL 

mP 

E 

iy 

3049 

4.72 

7.3 

6 

0.64 

m 

2134 

0.36 

7.0 

6 

0.44 

ih 

2342 

6.41 

6.9 

5 

0.77 

n 

2809 

0.37 

7.2 

6 

0.41 

eh 

2147 

12.58 

6.2 

4 

1.02 

ng 

1673 

0.28 

7.6 

6 

0.37 

ey 

1853 

10.51 

6.2 

4 

0.92 

em 

1628 

0.20 

6.2 

5 

0.43 

ae 

2214 

22.20 

5.8 

4 

1.07 

en 

1641 

0.20 

7.1 

6 

0.38 

aa 

2023 

24.42 

5.7 

3 

1.12 

eng 

1627 

0.20 

6.7 

7 

0.45 

aw 

1690 

19.81 

5.3 

4 

1.12 

nx 

1674 

0.37 

5.7 

5 

0.85 

ay 

1831 

18.78 

5.0 

4 

1.20 

mean 

7.0 

6 

0.45 

ah 

1938 

12.38 

6.1 

4 

0.97 

total 

13186 

0.675* 

ao 

2050 

16.79 

6.2 

4 

0.84 

oy 

1689 

13.79 

5.7 

4 

0.97 

Glides 

N 

P 

aL 

mP 

E 

OW 

1782 

13.15 

6.2 

4 

0.89 

1 

2706 

5.17 

6.3 

5 

0.76 

iih 

1712 

7.14 

6.9 

4 

0.81 

r 

2842 

6.89 

6.1 

4 

0.99 

uw 

1689 

3.59 

6.8 

4 

0.71 

w 

2114 

2.75 

6.5 

4 

0.66 

ux 

1745 

3.02 

7.4 

5 

0.65 

y 

1806 

1.01 

7.5 

7 

0.55 

er 

1871 

8.51 

6.5 

4 

0.87 

hh 

1716 

0.15 

5.7 

12 

0.45 

ax 

2209 

2.40 

6.4 

4 

0.77 

hv 

1683 

0.79 

5.9 

7 

0.79 

ix 

3285 

2.28 

6.7 

5 

0.71 

el 

1713 

3.92 

6.9 

4 

0.55 

axr 

2213 

3.33 

6.3 

4 

0.78 

mean 

6.4 

5 

0.75 

axh 

1695 

0.05 

7.0 

15 

0.30 

total 

14580 

12.034* 

mean 

6.4 

4 

0.86 

total 

41027 

81.926* 

56 


Table  3.1  provides  a comprehensive  analysis  of  the  seven  phoneme  regions  and  power 
distributions  of  the  1,681  TIMIT  test  sentence  database.  It  presents  the  total  number  of 
occurrences  for  all  the  phonemes  (N),  the  total  average  power  occupied  by  each  phoneme 
(P),  total  average  masked  power  (mP)  as  calculated  by  the  AMT  in  section  3.1.2,  and 
the  average  accessory  loudness  (aL)  percentage  as  determined  by  the  ISO-532B  loudness 
analysis.  Table  3.2  presents  a reduced  set  of  the  results  for  the  seven  major  phone  categories 
of  Table  3.1. 

Table  3.2:  Relative  occurrence  (N),  total  average  power  (P),  average  masked  Power 
(mP),  average  contribution  of  accessory  loudness  (aL),  and  approximation  error  (E)  for 
all  phoneme  categories  of  the  TIMIT  test  set. 


Category 

N 

P 

3-L 

mP 

E 

vowels 

41027 

81.93 

6.4 

4.0 

0.86 

nasals 

13186 

0.68 

7.0 

6.0 

0.45 

glides 

14580 

12.03 

6.4 

5.0 

0.75 

stops 

28532 

1.245 

6.6 

16.0 

0.53 

affricates 

1069 

0.28 

6.6 

46.0 

0.38 

fricatives 

15745 

3.84 

6.0 

33.0 

0.37 

Masked  Power 

Table  3.1  presents  the  total  phoneme  power  distributions  in  the  TIMIT  database.  Re- 
sults indicate  that  81.9%  of  the  total  power  resides  in  the  vowel  regions  of  speech,  and  they 
are  the  most  frequently  occurring  of  all  phoneme  categories.  Table  3.1  also  provides  the 
percentage  of  masked  power  and  the  contribution  of  accessory  loudness  due  to  masking. 
The  AMT  curves  for  all  phonemes  in  the  TIMIT  test  data  set  were  generated  as  described 
in  section  3.1.2.  The  analysis  sets  the  AMT  offset  based  on  the  overall  tonality.  Those  com- 
ponents which  fall  below  the  threshold  are  considered  inaudible.  The  total  power  of  these 
inaudible  components  describes  the  masked  vowel  power  seen  as  the  spectral  distribution 
under  the  curve  of  Figure  3.6.  Figure3.7  shows  one  example  sentence  of  the  TIMIT  and 
the  percentage  of  power  masked  over  the  entire  speech  utterance.  The  darkened  regions 
correspond  to  vowel  regions.  Coincidently,  they  also  occur  where  the  masking  percent  is 
the  lowest.  Table  3.1  also  shows  that  the  masked  power  level  of  4%  for  vowels  as  calculated 
by  the  AMT  is  the  lowest  of  all  phoneme  distributions. 


57 


don’t  ask  me  to  carry  an  oily  rag  like  that 


20  40  60  80  100 


Frame  number 

Figure  3.7:  Percent  of  masked  power  in  vowel  regions  {darkened)  of  TIMIT  speech  sentence. 

Accessory  loudness 

The  masking  percentages  determined  in  the  previous  subsection  do  not  specify  how  the 
masked  power  contributes  to  the  total  loudness.  We  can  assess  the  contribution  of  accessory 
loudness  by  inspecting  the  excitation  patterns  generated  through  loudness  analysis.  The 
ISO-532B  does  not  allow  a means  of  calculating  masked  power,  nor  is  it  possible  to  partition 
the  analysis  to  do  so.  This  is  why  we  employed  the  AMT  analysis  to  evaluate  the  masked 
power.  The  ISO-532B  by  virtue  only  allows  the  calculation  of  loudness  and  masked  loudness. 
It  is  very  difficult,  if  not  appropriate,  to  correlate  a masking  percentage  to  a loudness  value. 
Loudness  is  a function  of  frequency  and  amplitude;  power  is  not.  The  results  of  this  section 
present  the  accessory  loudness  analysis.  We  can  evaluate  the  accessory  loudness  in  the 
same  way  as  we  did  for  the  synthetic  vowels.  Accessory  loudness  levels  are  determined 
individually  for  all  the  labelled  phonemes  in  the  TIMIT  database,  and  grouped  together 
on  average  by  their  phoneme  type.  Figure  3.8  shows  the  accessory  loudness  contribution 
of  sentence  typical  for  most  sentences  evaluated  in  the  TIMIT  test  set.  The  accessory 
loudness  is  greatest  in  regions  of  already  occurring  maximal  loudness,  which  from  the  figure 
correspond  to  the  vowel  regions. 


58 


she  had  your  dark  suit  in  greasy  wash  water  all  year 


Frame  number 

Figure  3.8:  Accessory  loudness  of  TIMIT  speech  sentence  {vowel  regions  darkened). 

Table  3.1  also  provides  the  accessory  loudness  percentages  as  determined  by  the  ISO- 
532B  analysis  for  all  phoneme  regions  of  the  entire  TIMIT  Test  set.  The  percentage  is  the 
relative  proportion  of  accessory  loudness  in  a phoneme  region  to  total  loudness  in  sones  in 
that  phoneme  region.  It  was  necessary  to  state  results  as  a ratio  since  the  loudness  of  non- 
vowel regions  is  considerably  lower  as  seen  in  Figure  3.8.  Since  the  sone  scale  is  a measure 
of  loudness  we  can  express  the  accessory  loudness  contribution  as  a ratio,  or  percentage, 
of  total  loudness.  This  provides  a fair  comparison  between  vowel  and  non-vowel  regions. 
Interestingly,  in  the  results  of  Tables  3.1  and  3.2,  we  do  not  see  a notable  difference  between 
the  accessory  loudness  percentages  of  the  listed  phoneme  types.  Accessory  loudness  seems 
relatively  low  for  all  cases. 

We  may  have  expected  the  fricatives,  affricates,  and  stops  to  show  a higher  percentage  of 
accessory  loudness  since  their  masking  power  percentages  were  high.  We  may  have  expected 
the  vowels  to  show  a lower  accessory  loudness  percentage  since  they  showed  less  masking 
power.  Interestingly,  this  was  not  the  case.  When  we  evaluate  the  masking  power,  we 
discard  the  entire  signal  content  below  the  AMT,  not  the  energy  difference  between  the 
threshold  and  the  signal.  We  represent  the  discarded  power  as  the  masked  power.  When 


59 


we  evaluate  accessory  loudness,  we  include  only  the  specific  loudness  difference  between 
the  main  excitation  and  the  marked  excitation.  In  the  loudness  analysis  we  do  not  discard 
the  loudness  below  the  masked  excitation,  which  was  the  procedure  for  the  masking  power 
evaluation.  The  second  observation  to  consider,  is  that  the  AMT  offset  is  determined  by 
the  signal  tonality.  A higher  tonality  means  a larger  (negative)  offset.  The  AMT  is  lowered 
for  more  tonal  signals.  For  vowels,  this  action  naturally  decreases  the  total  masked  power. 
The  offset  due  to  tonality  makes  it  difficult  to  establish  a direct  correlation  between  masked 
power  and  accessory  loudness  with  these  results  and  the  AMT  analysis.  However,  the 
investigation  of  masking  level  and  accessory  loudness  in  the  TIMIT  test  sentence  database 
primarily  demonstrates  that  vowels  1)  have  the  highest  energy  in  speech,  and  2)  exhibit 
moderate  masking,  3)  provide  little  accessory  loudness,  and  4)  are  the  longest  in  duration 
of  all  the  phonemes.  These  vowel  properties  will  be  particularly  advantageous  in  a speech 
processing  routine  which  targets  vowel  regions  of  speech. 

3.2  Identification 

A significant  body  of  evidence  has  indicated  that  vowel  identification  is  influenced  by 
spectral  change  patterns  induced  by  the  consonant  environment  [59].  These  are  known  as 
formant  transition  cues,  and  suggest  that  distinctive  features  of  speech  are  associated  with 
the  rapid  spectrum  changes  following  the  transition  from  a consonant  to  a vowel  [77].  Acous- 
tic cues  and  phonetic  contrasts  predicate  the  intelligibility  of  speech.  Speech  enhancement 
experiments  have  shown  that  amplified  and  filtered  consonantal  regions  of  natural  vowel- 
consonant-vowel  (VCV)  utterance  segments  significantly  improve  intelligibility  [52].  The 
manipulation  of  consonant-vowel  intensity  ratios  has  a pronounced  effect  on  intelligibility 
[18],  and  experimental  studies  have  additionally  suggested  that  formant  amplitude  ratios 
affect  perceived  vowel  quality,  primarily  its  place  of  articulation[67].  Studies  also  reveal  the 
consonant  to  vowel  transitions  plays  a greater  role  in  speech  cognition  than  consonants  or 
vowels  alone  [36].  Certain  enhancement  techniques  which  amplify  regions  of  rapid  spec- 
tral change  and  manipulate  segment  duration  have  been  claimed  to  be  beneficial  in  speech 
training  with  some  language  disordered  children  [43]. 


60 


We  have  seen  that  vowels  are  characterized  by  their  formant  peak  amplitudes  and  loca- 
tion. Several  studies  have  investigated  the  minimum  difference  in  amplitude  between  for- 
mant peaks  and  valleys  sufficient  for  vowel  identification  by  normal  and  hearing  impaired 
listeners  [97,  25,  132],  The  general  approach  to  these  studies  is  to  bandwidth  broaden 
vowel  spectra  until  normal  hearing  listeners  exhibit  the  same  identification  performance 
of  the  hearing  impaired  listeners.  Consequently,  the  VMP  studies  describe  the  extent  to 
which  vowel  bandwidths  can  be  widened  before  vowel  identification  is  sacrificed.  In  nat- 
ural speech,  peak  to  trough  differences  in  amplitude  vary  with  vowel  identity,  where  the 
difference  can  be  a.s  large  as  25-30dB  in  front  vowels  and  5-7dB  in  other  vowels.  Reports 
indicate  that  normal  hearing  listeners  can  achieve  75%  accuracy  for  differences  as  low  as 
l-2dB  for  certain  vowel  sounds  which  correlate  to  bandwidth  expansions  of  2 to  3 times 
the  original  bandwidth  [80].  Experimental  results  also  report  that  “for  vowel  stimuli  with 
widened  higher  frequency  formants,  vowel  identification  is  unaffected  until  the  first  formant 
bandwidth  is  six  times  wider  than  normal”  [25]. 

3.2.1  Loudness  Adaption  and  Auditory  Fatigue 

Loudness  adaptation  occurs  when  the  hearing  system  is  exposed  to  sounds  over  mod- 
erate or  prolonged  periods  of  time.  Loudness  adaptation  typically  means  the  perceptual 
loudness  level  of  a sound  decreases  in  response  to  some  form  of  conditioning  by  a continuous 
sound.  Most  studies  reveal  that  loudness  adaptation  is  greatest  for  pure  tone  sounds,  and 
is  most  prominent  for  fixed-level  continuous  sounds  under  30dB  and  those  near  threshold 
[55].  The  adaptation  of  loudness  for  bioadband  noise  or  complex  tones,  however,  remains 
relatively  constant  over  time,  even  near  threshold,  and  well  above  30dB  [118].  Broadband, 
narrowband,  and  multi-tone  complexes  adapt  much  less  than  tones  under  most  stimulus 
conditions  [117].  One  hypothesis  is  that  loudness  adaptation  is  assumed  to  take  place  when 
excitation  is  restricted  to  a narrow  region  of  the  cochlea  [55] . Loudness  adaptation  for  pure 
tones  depends  on  frequency  and  level,  and  the  adaptation  is  greater  at  high  frequencies  than 
at  low  frequencies.  At  all  frequencies  though,  the  degree  of  adaptation  decreases  as  sensa- 
tion level  increases  [55].  Similar  studies,  reveal  that  tones  which  increase  in  level  change  in 


61 


loudness  at  a slower  rate  than  tones  that  decreased  in  level  [118].  In  essence,  these  studies 
state  that  broadband  spectral  distributions  exhibit  less  loudness  adaptation  than  a more 
tonal  distribution.  We  can  infer,  that  a technique  which  broadens  the  spectral  bandwidth 
will  be  less  detrimental  to  loudness  adaptation  in  situations  where  adaptation  is  evident. 

Auditory  fatigue  is  related  to  loudness  adaptation  in  that  fatigue  is  a change  in  auditory 
sensitivity  which  follows  from  the  exposure  to  soft,  medium,  or  loud  sounds.  Anatomical 
observations  of  the  inner  hair  cells  after  exposure  to  tones  of  moderate  to  high  level  and 
the  activity  of  cochlear  nerve  fibers  suggest  that  auditory  fatigue  originates  in  the  cochlea 
[9].  Auditory  fatigue  results  in  an  elevated  loudness  threshold  for  which  the  recovery  time 
depends  on  the  duration,  level,  and  frequency  of  the  sound  exposure.  Auditory  fatigue  can 
be  described  by  the  loudness  recalibration  effect.  Recalibration  is  the  phenomena  in  which 
the  temporal  ordering  of  complex  tones  has  an  effect  on  the  perceived  loudness  [86].  A 
tone  that  is  higher  in  frequency  than  another  may  sound  louder  or  less  loud  depending  on 
the  presentation  order  and  sound  duration.  Such  recalibration  effects  are  termed  loudness 
shifts.  Recalibration  is  similar  to  loudness  adaptation  in  that  the  perception  of  loudness 
is  contingent  on  the  distribution  of  tonal  stimuli  varying  in  frequency  and  level.  Similar 
loudness  shift  effects  have  been  observed  in  forward  and  backward  masking  experiments 
[111].  In  these  studies,  a masker  signal  preceeding  a tone  burst  can  increase  the  perceptual 
loudness  of  the  tone.  Recall  that  spontaneous  maskers  usually  attenuate  neighboring  tonal 
components.  The  elevation  under  forward  masking  has  been  hypothesized  as  a result  of 
the  selective  adaptation  of  low  spontaneous  rate  auditory  nerve  fibers  [146].  It  is  also 
suggested  that  these  effects  may  be  caused  by  long  term  loudness  integration.  Studies  prior 
to  the  noise  notch  method  discussed  in  Chapter  2 examined  the  temporal  acuity  of  the 
auditory  system  through  gap  detection  experiments.  Experimental  results  revealed  that 
gap  detection  is  largely  independent  of  the  temporal  position  of  the  gap  within  the  noise. 
The  studies  conclude  that  gap  detection  thresholds  are  relatively  insensitive  to  changes  in 
normal  level,  total  duration,  and  temporal  position  [45].  Auditory  fatigue  demonstrates 
that  the  perceived  loudness  of  an  auditory  event  depends  on  the  way  the  event  sequence  is 
organized.  Such  a description  is  that  of  auditory  continuity  and  suggests  that  loudness  is 
computed  subsequent  to  auditory  organization  [87].  It  is  the  ability  of  the  auditory  system 


62 


to  segregate,  or  stream,  sound  into  a coherent  representation  that  allows  for  the  recognition, 
association,  or  relationship  of  a signal  which  allows  for  higher  levels  of  interpretation  [88]. 

3.2.2  Formant  Expansion 

Our  review  of  loudness  in  Chapter  2 revealed  that  loudness  is  primarily  a function  of  level 
and  bandwidth,  and  that  traversing  over  critical  bands  increases  the  perception  of  loudness. 
Moore  and  Glasberg  extended  the  description  of  loudness  as  a function  of  excitation,  due  to 
the  nature  of  the  auditory  filters.  The  auditory  filter  bandwidths  increase  with  increasing 
frequency,  and  the  resulting  excitation  pattern  broadens  since  it  represents  the  residual 
outputs  of  the  auditory  filters  as  was  seen  in  Figure  2.8.  Loudness  increases  when  the  total 
specific  loudness  due  to  excitation  increases.  The  loudness  curves  also  demonstrate  that 
we  are  more  sensitive  to  higher  frequencies.  The  decrease  in  sensitivity  above  this  range  is 
due  to  the  outer  to  middle  ear  transform  which  attenuates  our  high  frequency  hearing.  A 
broadband  signal  with  a constant  energy  bandwidth  product  in  the  critical  band  auditory 
filter  range  produces  equal  loudness  [149].  When  the  bandwidth  exceeds  this  critical  band 
the  loudness  increases  even  though  the  intensity  remains  constant  [11]. 

Vowels  are  high  energy  and  spectrally  smooth.  In  Table  3.2  we  saw  that  vowels  in 
the  TIMIT  test  database  contain  80  % of  the  speech  power,  and  that  the  majority  of 
this  power  is  unmasked.  The  fact  that  accessory  loudness  in  vowels  is  roughly  5%  should 
reassure  us  that  a loudness  enhancement  technique  which  exploits  masking  properties  is 
usefully  limited.  Techniques  which  exploit  masking  properties  such  as  MP3  are  better 
suited  for  routines  with  perceptually  complex  signals  such  as  music,  where  the  emphasis 
is  on  data  compression  not  energy  conservation  [105,  3].  Vowels  are  spectrally  smooth 
and  resonant  in  nature.  We  have  also  seen  that  formant  bandwidths  in  vowels  can  be 
severely  widened  without  sacrificing  identification  [25],  and  that  consonants  provide  a more 
prominent  role  in  speech  intelligibility  than  vowels  [53].  We  have  also  seen  that  speech 
intelligibility  is  primarily  determined  by  the  consonant  to  vowel  transitions  and  not  the 
steady  state  region  of  vowels  [138].  Loudness  analysis  and  listening  results  also  indicate  that 
the  peak  loudness,  normally  assumed  to  be  perceived  loudness,  is  produced  by  vowels  in 


63 


speech  [149].  Loudness  patterns  reveal  that  consonants  and  plosives  are  almost  negligible 
in  terms  of  loudness.  In  section  3.2.1  studies  reveal  that  loudness  adaptation  decreases  as 
the  spectral  distribution  is  broadened.  We  have  also  seen  that  spectral  and  temporal  dips 
are  used  as  acoustic  and  phonetic  cues.  By  elevating  regions  of  already  occurring  maximal 
loudness  we  increase  the  temporal  modulation  depth.  Studies  in  section  3.2.3  will  also  show 
that  an  increase  in  temporal  modulation  depth  can  favorably  enhances  the  perception  of 
loudness. 

From  these  observations,  a loudness  enhancement  technique  which  preserves  energy 
would  best  operate  on  the  vowel  regions  of  speech.  In  addition,  the  critical  band  concept 
should  be  incorporated  in  the  technique.  In  light  of  this,  a technique  which  moderately 
expands  the  vowel  regions  of  speech  should  increase  the  perception  of  loudness  without 
degrading  intelligibility.  We  call  this  technique  the  method  of  formant  expansion.  It  is  a 
spectral  envelope  expansion  method  which  operates  on  the  formant  structures  of  vowels. 
Figure  3.3  demonstrates  the  basic  principle  of  the  loudness  enhancement  technique.  The 
spectrum  is  slightly  flattened  and  then  elevated  to  the  original  energy  level.  The  first 
step,  shown  in  Figure  3.3a,  bandwidth  broadens  the  poles.  This  also  reduces  the  overall 
energy  level  of  the  signal.  The  second  step,  as  seen  in  Figure  3.3b,  restores  the  original 
energy  level  and  essentially  raises  the  dilated  spectrum.  This  elevation  effectively  broadens 
the  bandwidth,  which  according  to  the  critical  band  concept,  if  wide  enough,  will  increase 
perceived  loudness. 

Clinical  experiments  have  demonstrated  that  additive  noise  on  vowels  can  have  the  effect 
of  enhancing  formant  information  on  cochlear  implant  patients  [98].  The  noise  generates 
a large  frequency  distribution  over  critical  band  regions  which  enhances  loudness.  The 
concept  of  formant  expansion  is  similar  in  that  it  spreads  the  signal  energy,  but  certainly 
different  in  that  it  does  not  disrupt  the  formant  fine  structure.  The  general  characteristics 
of  the  vowel  are  preserved  but  the  bandwidth  is  stretched  to  enhance  the  perceived  loudness. 
There  is  a compromise  to  the  amount  of  bandwidth  expansion  allowable  and  the  tolerance 
to  intelligibility.  Studies  have  shown  that  vowel  bandwidths  can  be  widened  up  to  3 times 
their  bandwidth  without  degrading  intelligibility  [25].  Preliminary  subjective  experiments 
demonstrate  an  improved  perception  of  loudness,  though  with  lessened  vocal  presence.  The 


64 


Linear  Bandwidth  Expansion 


Energy  restoration 


Figure  3.9:  Formant  bandwidth  expansion  on  synthetic  vowel  /a/;  a)  LPC  pole  displacement 
broadens  bandwidth  by  reducing  formant  pole  peaks,  and  b)  elevation  of  spectrum  to  restore 
energy. 


method  ha.s  an  effect  on  the  ’coloration’  or  ’presence’  of  the  speech,  which  is  addressed  in 
Chapter  5. 


3.2.3  Modulation  Depth 

One  of  the  limitations  of  a loudness  model,  such  as  Zwicker’s  or  Moore  and  Glasberg’s, 
is  that  it  does  not  account  for  temporal  fluctuations.  Traditional  loudness  models  are  based 
on  the  average  energy  using  a critical  band  analysis  of  steady  state  sounds.  Most  natural 
sounds  such  as  speech  are  dynamic  stimuli  for  which  the  average  level  does  not  account 
for  the  large  temporal  fluctuation.  Some  studies  reveal  that  for  two  stimuli  of  the  same 
level,  the  stimulus  with  greater  temporal  fluctuations  sometimes  produces  a significantly 
louder  sensation  [147].  These  results  cannot  be  predicted  with  a loudness  model  which 
does  not  incorporate  the  temporal  aspects  of  speech.  It  has  also  been  reported  that  these 
spectral  or  temporal  dips  are  important  for  speech  discrimination  and  intelligibility  [101]. 


65 


People  with  cochlear  hearing  loss  have  a reduced  ability  to  make  use  of  these  cues.  This 
reduced  ability  results  in  a reduced  dynamic  range  for  hearing  impaired  patients  and  leads 
to  the  effect  of  loudness  recruitment  [108].  Since  the  dynamic  range  is  smaller,  and  low  and 
high  level  extremes  are  unchanged,  sounds  become  progressively  louder  within  the  interval 
range.  Studies  have  also  revealed  that  a given  modulation  depth  in  a hearing  impaired 
ear  is  matched  by  a greater  modulation  depth  in  a normal  ear  [95].  It  has  been  suggested 
that  this  effect  of  loudness  recruitment  results  from  the  loss  of  a fast  acting  compressive 
non-linearity.  This  may  be  partly  due  to  the  effects  of  adaptation  which  tend  to  reduce  the 
dynamic  range  of  the  nerve  fiber  responses  [121]. 

Studies  reveal  that  the  loudness  of  a sentence  does  not  correspond  to  an  average  value 
of  the  fluctuating  loudness-time  function,  but  rather  to  a loudness  value  near  the  maximum 
[149].  The  overall  perception  of  loudness  is  time  dependent.  Boosting  individual  regions  to 
increase  loudness  in  less  loud  regions,  such  as  amplifying  consonant  regions,  may  not  provide 
as  drastic  as  an  effect  as  increasing  the  regions  of  already  occurring  maximal  loudness. 
However,  this  may  also  induce  an  amplitude  modulation  effect  which  may  disrupt  the 
natural  amplitude  balance  of  speech.  Human  hearing  is  based  on  a relative  scale  and  is 
less  sensitive  to  more  slowly  varying  time  modulation  envelopes  [57].  As  mentioned  in  the 
preceeding  paragraph,  though,  loudness  is  sometimes  enhanced  for  certain  tone  complexes 
which  exhibited  larger  temporal  envelope  modulations.  Figure  3.10  demonstrates  what  an 
individual  subjectively  determines  to  be  sounds  of  two  equally  loud  signals.  The  left  hand 
signal  is  a speech  sentence  and  the  right  hand  side  is  that  of  white  noise.  In  note  of  the 
fact  that  loudness  studies  correlate  the  overall  level  of  loudness  to  a level  near  the  maximal 
temporal  amplitudes,  the  formant  expansion  method  should  show  an  elevation  of  not  only 
the  local  loudness,  but  the  overall  level  of  perceived  loudness. 

3.2.4  Synthetic  Vowel  Loudness 

The  following  analysis  examines  the  change  in  loudness  with  respect  to  the  amount  of 
bandwidth  expansion  in  ten  synthetic  vowels.  The  formant  bandwidths  are  adjusted  by 
changing  the  pole  radii  as  described  by  where  is  the  formant  bandwidth 


66 


Figure  3.10:  Perceived  equal  loudness  time  functions  {sentence  and  noise). 

of  pole  p.  The  band  widths  are  updated  at  each  iteration  i of  the  loudness  analysis  as  1)  a 
function  of  constant  frequency  Eq(3.6),  and  2)  a function  of  bandwidth  Eq(3.7).  The  vowel 
bandwidths  are  increased  up  to  6 times  their  original  bandwidths  give  in  Figure  3.2. 


B{i)p  = B{i)p  ^ + 100//2  * factor 

(3.6) 

B{i)l  = B{iy-^{1  + factor) 

(3.7) 

Figure  3.11  shows  the  resulting  ISO-532B  loudness  analysis  for  both  expansions.  The 
dotted  line  represents  the  loudness  from  Eq(3.7)  and  the  solid  line  from  Eq(3.6).  The 
upper  two  curves  represent  the  total  loudness  and  the  lower  two  curves  represent  the  main 
loudness.  The  difference  between  the  two  represent  the  gain  due  to  accessory  loudness.  In 
both  cases  we  see  a monotonic  increase  in  total  loudness  at  approximately  the  same  rate 
as  unmasked,  or  main,  loudness.  We  also  see  the  loudness  increases  when  the  bandwidth 
increase  is  a function  of  bandwidth.  This  intuitively  makes  sense  since  critical  bands  increase 
with  increasing  frequency.  The  loudness  results  of  bandwidth  expansion  are  characteristic 
of  all  the  synthetic  vowels  examined,  and  demonstrate  similar  graphs  of  the  curves  in  Figure 
3.11. 

From  these  general  results,  we  can  see  that  increasing  formant  bandwidth  will  increase 
loudness.  We  also  see  that  increasing  bandwidth  on  a critical  band  scale  will  improve  the 
rate  increase  in  loudness  when  the  bandwidths  are  excessively  broadened.  Both  methods 
for  adjusting  the  bandwidth  produce  close  loudness  results  when  the  bandwidth  factor  is 
less  than  two.  We  can  assume  the  similarity  in  loudness  values  below  this  factor  for  both 


67 


Figure  3.11;  The  increase  of  loudness  as  a function  of  vowel  bandwidth. 

methods  is  due  to  the  high  energy  of  the  first  formant.  The  higher  formant  bandwidths  con- 
tribute more  to  loudness  as  the  first  formant  bandwidth  decreases.  In  this  chapter  we  have 
seen  that  vowels  contain  the  most  energy  in  speech,  are  the  most  numerous  of  the  phonemes, 
provide  only  a small  contribution  of  accessory  loudness,  have  smooth  spectral  shapes,  have 
broad  bandwidths  that  increase  with  increasing  frequency,  and  external  research  reveals 
that  vowel  bandwidths  can  be  expanded  without  affecting  vowel  identification.  In  the  next 
chapter  we  will  examine  a filtering  technique  to  manipulate  formant  bandwidths  such  that 
the  perception  of  loudness  can  be  elevated  without  sacrificing  vowel  identification. 


CHAPTER  4 

WARPED  LINEAR  PREDICTION 


Linear  prediction  analysis  is  a well  known  procedure  for  modelling  acoustical  speech 
behavior  [85] . It  relies  on  the  observation  that  speech  is  a rather  slowly  time  varying  signal 
with  fairly  stationary  characteristics.  Linear  prediction  developed  from  models  of  speech 
production.  It  is  related  to  a speech  production  model  in  that  the  parameters  of  the  speech 
production  model  are  obtained  using  linear  mathematics.  The  linear  speech  production 
model  developed  by  Fant  in  the  late  1950’s  [30]  was  the  most  successful  speech  model  used 
to  describe  the  observed  characteristics  of  human  speech  [85].  The  linear  model  of  speech 
is  briefly  presented  to  review  the  concept  of  linear  prediction. 

The  linear  model  assumes  a glottal  excitation  source  stimulates  a vocal  tract  model 
which  then  passes  through  a lip  radiation  model.  The  model  is  represented  by  the  following 
equation 


S{z)  = E{z)G{z)V{z)L{z) 

where  E(z)  represents  the  excitation,  G(z)  the  glottal  shaping,  V(z)  the  vocal  tract  model, 
and  L(z)  the  lip  model.  The  glottal  excitation  is  the  quasi-periodic  pulse  train  of  air 
produced  by  the  vibration  of  the  vocal  chords  in  response  to  air  flow  from  the  lungs.  The 
glottal  shaping  model  is  of  the  form  [85] 

G{z)  = l/il-e-^'^z-^f 

and  the  lip  radiation  model  is  of  the  form  [85] 

L{z)  = l-z~^ 


68 


69 


The  vocal  tract  model  is  an  all-pole  model  consisting  of  a cascade  of  second  order  two-pole 
resonators,  where  each  resonance  models  a formant  of  the  speech.  A three  formant  synthesis 
model  of  this  type  was  presented  in  chapter  3.1.1.  An  all  pole  filter  can  be  used  to  describe 
the  linear  speech  production  model,  and  is  represented  by  the  following  equation 

G{z)V(z)L(z)  = ^ = r (■»■!) 

1 - ’Y^akZ~'' 

k=l 

The  all-zero  filter  A{z)  is  referred  to  as  the  inverse  filter.  It  is  used  in  the  analysis  model 
E{z)  = S{z)A{z).  The  reciprocal  of  Eq(4.1)  is  referred  to  as  the  all-pole  model.  It  is  used 
in  the  all-pole  speech  synthesis  S{z)  = E{z)A~^{z). 


4.1  Linear  Prediction  Model 


Linear  prediction  of  speech  rests  on  the  notion  that  the  parameters  of  the  speech  pro- 
duction model  vary  slowly  over  time.  And,  that  during  any  particular  interval  of  long 
enough  duration,  the  speech  waveform  can  be  represented  by  a linear  combination  of  its 
past  values.  The  LPC  model  is  described  by 

p 

s{n)  = ^^aks{n  — k)  + Gu{n)  (4.2) 

k=i 

where  u(n)  is  the  normalized  excitation  and  G is  the  excitation  gain.  This  leads  to  the 
transfer  function 


H{z) 


S{z) 

GU{z) 


1 

1 - 

fc=i 


1 

W) 


The  Linear  Predictive  Coding  (LPC)  model  has  been  well  understood  since  the  early  1970’s 
and  has  been  widely  used  for  the  following  reasons  [114] 

1.  LPC  provides  a good  approximation  to  the  vocal  tract  spectrum 

2.  It  is  sufficiently  able  to  resolve  and  separate  the  vocal  tract  model  and  excitation 
source. 

3.  It  is  an  analytically  tractable  model 


70 


4.  It  works  well  for  distortion  measures  in  speech  recognition 
The  LPC  analysis  equations  provide  a means  of  evaluating  the  prediction  error.  The  predic- 
tion error  is  used  as  a minimization  criterion  to  finding  the  optimal  filter  coefficients  ak  which 
best  represent  the  speech  signal  in  a mean  squared  error  sense.  The  prediction  error  states 
how  close  the  synthetic  representation  of  the  speech  s is  to  the  original  speech,  s represents 
the  linear  combination  of  past  speech  samples  s = ais{n  — 1)  -h  a2s{n  — 2)  + ...aps{n  — p). 
The  prediction  error  is  defined  as 


p 

e(n)  — s{n)  — s{n)  = s(n)  — ^a/;s(n  — k)  (4.3) 

fc=i 

which  leads  to  the  error  transfer  function 


= = (4.4) 

''  ^ k=l 

When  s(n)  is  actually  generated  by  the  linear  system  in  Eq(4.2)  the  prediction  error  e(n) 
equals  the  scaled  excitation  Gu{n).  The  task  of  linear  prediction  is  to  find  the  set  of 
coefficients  in  Eq(4.3)  which  minimize  the  mean  squared  error.  The  set  of  equations  which 
must  be  solved  to  determine  the  optimal  predictor  coefficients  are  known  as  the  set  of 
normal  equations  and  are  given  by 

p 

0)  = X]  k)  (4.5) 

k=l 

where  k)  represents  the  terms  of  the  speech  short-term  covariances.  The  autocorrelation 
method  below  is  usually  employed  to  solve  the  p set  of  equations  with  p unknowns. 

p 

- k\)ak  = rn{i),  1<*<P  (4.6) 

k=l 

where  is  the  autocorrelation  at  lag  k. 

In  the  autocorrelation  method  the  error  sequence  e(n)  obtained  by  passing  the  original 
signal  through  the  filter,  A{z),  is  a whitened  version  of  the  original  signal  [85].  The  spectral 
flatness  measure  describes  the  level  of  whitening  and  is  shown  to  be  physically  meaningful 
in  regards  to  spectral  matching.  Linear  prediction  analysis  demonstrates  the  following:  1) 
Minimization  of  the  mean  square  value  of  the  prediction  error  signal  is  equivalent  to  the 


71 


maximization  of  the  prediction  gain,  and  2)  Spectral  flatness  and  minimum  power  of  the 
error  are  considered  equivalent  criteria.  In  a following  section,  we  will  see  the  these  two 
criteria  are  challenged  for  warped  LPC  filters. 

4.2  Bandwidth  Expansion 

An  LPC  technique  used  to  alter  formant  bandwidth  is  given  in  Eq  (4.7). 

(4.7) 

Jfc=0 

It  provides  a way  to  evaluate  the  Z transform  on  a circle  with  radius  r greater  than  or 
less  than  the  unit  circle  r = 1,  and  is  based  on  McCandless  procedure  [89].  A graphical 
demonstration  of  the  procedure  is  presented  in  Figure  4.1.  For  0 < r < 1 the  evaluation 
is  on  a circle  closer  to  the  poles  and  the  contribution  of  the  poles  has  effectively  increased, 
thus  sharpening  the  pole  resonance.  Stability  is  of  concern  since  1/A(z)  is  no  longer  an 
analytic  expression  within  the  unit  circle. 


Figure  4.1:  Pole  displacement  model  used  to  demonstrate  an  evaluation  off  the  unit  circle 
with  r > 1 results  in  a broadened  pole  response  (shaded  region). 

For  r > 1 (bandwidth  expansion)  the  evaluation  is  on  a circle  farther  away  from  the 
poles  and  thus  the  pole  resonance  peaks  decrease  and  the  pole  bandwidths  are  widened. 
The  poles  are  always  inside  the  unit  circle  and  1/A(z)  is  stable.  The  bandwidth  adjustment 


72 


technique  [85]  simply  requires  a scaling  of  the  LPC  coefficients  by  a power  series  of  r, 
Eq(4.7).  If  the  poles  are  well  separated  [114],  the  bandwidth  can  be  related  to  the  pole 
radius  k by 


B ^ -\n{kjr)fs/-K 


(4.8) 


where  r = 1.0  defines  evaluation  on  the  unit  circle.  This  follows  from  an  s-plane  result 
that  the  bandwidth  of  a pole  in  radians  per  second  is  equal  to  twice  the  distance  2a  of  the 
pole  from  the  jw-a.xis  when  the  pole  is  isolated  from  other  poles  [85].  Figure  4.2  shows  how 
strips  of  width  27t  are  mapped  from  the  left  hand  s plane  to  revolutions  of  the  unit  circle  in 
the  z domain.  Thus  a pole  at  a distance  ao  will  be  mapped  to  a bandwidth  B = 2ao/27r. 


B=2a„ 

jw 

A 

< — > 

a 

a 

Figure  4.2:  Relation  of  pole  distance  from  jw-axis  to  pole  bandwidth  in  Laplace  space. 

If  the  mapping  between  the  s and  .2  domain  is  given  by  z = then  a pole  with  radius 
k is  related  to  the  bandwidth  B by  , 

s = a -\-  ju 
z = 
z = 
ke^  = 

and  setting  a = Btv  we  get  k = as  noted  by  Eq(4.8).  When  the  evaluation  is  on  a 
circle  other  than  the  unit  circle,  the  bandwidth  increase  is  related  to  the  new  evaluation 
circle  with  radius  r > 1 by 


AB  ^ -[\n{k/f) -\n{k)]fs/n 

AB  ^ \n{f)fs/n 


(4.9) 


73 


Interestingly,  we  see  Eq(4.9)  is  the  same  form  as  the  equation  in  Eq(4.8)  except  it  represents 
the  change  in  bandwidth  and  is  a constant  function  of  radius.  The  bandwidth  broadening 
technique  of  Eq(4.7)  can  be  put  in  the  filter  form  of  Eq(4.10). 


H{z) 


A{z/I3) 


(4.10) 


where  the  bandwidth  factors  7 and  P set  the  level  of  bandwidth  adjustment.  Filters  of  this 
type  have  been  used  as  a weighting  functions  for  spectral  distortion  measures  in  speech 
recognition  [125],  and  to  sharpen  formant  bandwidths  as  a speech  enhancement  method 
to  improve  speech  intelligibility  in  noise  [84].  A bandwidth  factor  is  chosen  to  narrow  the 
formant  bandwidths  which  effectively  raise  the  formant  peaks  above  the  noise.  It  is  also 
common  to  speech  coding  and  has  been  used  as  a compensation  filter  for  the  bandwidth 
underestimation  problem  [116].  It  is  also  used  as  a postfilter  to  enhance  the  relative  quality 
of  vocoded  speech  due  to  quantization  [15]. 


4.3  Vocoders 

Vocoders  attempt  to  achieve  the  highest  data  compression  while  maintaining  speech 
signal  integrity.  Low-bit  rate  vocoders  are  specialized  in  achieving  the  best  signal  quality 
compromise  allowable.  The  operations  of  a vocoder  are  defined  to  reproduce  the  processed 
speech  as  closely  as  possible  to  the  original.  Many  sophisticated  audio  processing  systems 
attempt  to  exploit  the  auditory  system’s  natural  masking  phenomena.  Certain  vocoders 
directly  incorporate  psychoacoustic  criterion  in  their  bit  compression  schemes  [69,  70].  In 
this  section,  we  investigate  the  bandwidth  expansion  technique  used  in  vocoder  design.  It 
is  used  to  alleviate  quantization  noise  and  the  residual  effects  of  the  coding  process.  Two 
of  the  typically  defined  internal  vocoder  filtering  operations  which  employ  this  technique 
are  1)  perceptual  noise  spectral  shaping  and  2)  adaptive  post-filtering.  Both  of  these  filters 
take  the  same  form  of  Eq(4.10)  . 

The  Code  Excited  Linear  Prediction  (CELP)  algorithm  has  been  shown  to  achieve 
high  quality  speech  coding  at  low  bit  rates  [33].  It  is  similar  to  the  family  of  excited 
linear  prediction  vocoders:  VSELP,  ACELP,  RELP.  Excited  linear  prediction  vocoders 


74 


incorporate  vector  quantization  (VQ)  codebook  sequences  to  reduce  the  bit-rate  coding 
requirements.  The  VQ  codebook  sequences  are  hardcoded  random  noise  sequences.  These 
are  the  excitation  sequences  used  in  the  re-synthesis  of  the  speech  from  the  LPC  model 
parameters.  The  excitation  sequence  is  selected  from  an  exhaustive  search  of  the  entire 
codebook.  The  selected  codebook  sequence  produces  the  minimum  mean  squared  error 
between  the  input  speech  signal  s and  synthesized  signal  where  the  MSE  criterion  is 
Efc  = ||sj^.  — Figure  4.3  shows  the  basic  CELP  encoder  block  diagram  from  [33]. 


ENCODER: 


speech 


K 

G, 


Figure  4.3:  General  CELP  coder  block  diagram. 


4.3.1  Perceptual  Noise  Weighting 

Most  all  vocoders  incorporate  some  form  of  noise  spectral  shaping  in  the  coding  process, 
and  a noise  compensation  technique  in  the  reconstruction  [120].  Noise  spectral  shaping  is 
a heuristic  method  to  define  a noise  error  function  used  to  weigh  the  importance  of  certain 
frequency  regions  [69].  In  Figure  4.3  we  see  the  perceptual  noise  weighting  filter  in  the 
dotted  box  just  prior  to  the  selection  of  the  minimum  error  codebook  sequence.  This  can 
be  interpreted  as  a perceptual  weighting  of  the  error  signal  prior  to  codeword  comparisons. 
A{z/fi)  represents  the  bandwidth  expansion  of  A{z).  The  perceptual  weighting  filter  of 
Eq(4.10)  is  used  in  the  codebook  selection  process.  The  weighting  effectively  generates  a 
noise  error  function,  since  noise  in  the  formant  regions  is  not  as  serious  as  noise  between 
formants. 

Figure  4.4  shows  the  noise  error  function  for  a synthetic  segment  of  speech  with  /3  = 0.7. 
The  noise  error  function  retains  the  formant  pole  locations  and  elevates  the  allowable  noise 
error  in  tonal  regions.  The  perceptual  noise  wieghting  filter  is  applied  to  the  excitation 


75 


0 1000  2000  3000  4000 

Frequency 


Figure  4.4:  Perceptual  noise  weighting  a)  vocal  tract  l/A{z),  b)  coding  noise  A(z! fi)IA{z)^ 
and  c)  excitation. 

speech  signal  as  can  be  seen  in  Figure  4.3.  The  perceptual  weighting  effectively  alters 
the  fiat  spectrum  of  the  excitation  seen  in  Figure  4.4c  nearer  to  that  of  human  hearing 
sensitivity.  Humans  capture  features  from  high  SNR  regions  and  extrapolate  to  regions 
of  low  SNR.  It  is  also  known  that  a deceptively  high  SNR  measure  can  be  obtained  if 
an  utterance  contains  a high  concentration  of  voiced  segments,  since  noise  has  a greater 
perceptual  effect  in  low-energy  segments  [21]. 

4.3.2  Adaptive  Post-filtering 

The  standard  adaptive  post-filter  proposed  by  [15]  and  given  by  Eq(4.11)  describes 
a foursome  filtering  operation:  long-term  filter,  short-term  filter,  tilt  compensation,  and 
adaptive  gain  control. 


H{z) 


GlHl{z)Hs{z)Ht{z) 
1 -I-  >^(-2/7) 


Gl 


1 -t-  \z~v  A{zlff) 


(1  - [IZ 


(4.11) 


In  Eq(4.11),  p is  the  pitch  period  and  the  remaining  parameters  specify  the  respective 
operations  of  each  individual  stage.  The  long-term  filter  Hi{z)  is  similar  to  a comb  filter 
in  that  it  attenuates  frequency  components  between  pitch  harmonic  peaks.  The  short 
term-filter  Hs{z)  provides  the  attenuation  between  spectral  peaks  and  restores  the  spectral 
characteristics.  The  tilt  compensation  Ht{z)  adjusts  for  speech  spectral  tilt.  The  gain 


76 


adjustment  G is  used  to  compensate  for  gain  differences  between  the  reconstructed  signal 
s(n)  and  the  post-filtered  signal.  The  automatic  gain  control  operates  with  the  long-term 
filter.  The  adaptive  behavior  of  the  post-filter  comes  from  this  process. 

The  quantization  process  tends  to  underestimate  (reduce)  formant  bandwidths  [131]. 
The  perceptual  weighting  filter  attempts  to  compensate  for  this  bandwidth  underestimation 
problem.  It  provides  for  a small  amount  of  bandwidth  expansion  prior  to  quantization 
[116].  However,  at  low  coding  rates,  noise  spectral  shaping  alone  does  not  completely 
remove  the  coding  noise.  Adaptive  post  filtering  operations  are  additionally  used  to  suppress 
quantization  noise  in  the  more  sensitive  valley  regions  [33].  In  the  same  reasoning  that  the 
perceptual  noise  weighting  filter  is  used  to  distribute  coding  noise  to  higher  energy  regions 
such  as  formants,  the  post  filter  attempts  to  suppress  quantization  noise  in  the  valley  regions. 
By  amplifying  the  less  sensitive  formant  regions,  the  perception  of  quantization  noise  in  the 
valleys  is  reduced.  However,  most  post-filters  are  based  on  formant  enhancement  and  not 
formant  expansion  [16,  78,  15].  These  post-filtering  techniques  are  usually  internal  to  the 
vocoder,  though  many  post-processing  filters  have  been  proposed  to  subjectively  improve 
speech  signal  quality  [131,  15,  34]. 


DECODER: 


Figure  4.5:  General  CELP  decoder  block  diagram. 

Quantization  generates  audible  noise  in  the  reconstructed  speech.  The  short-term  post- 
filter is  used  to  improve  the  overall  quality  of  the  synthesized  speech  and  to  alleviate  these 
effects.  Figure  4.5  shows  a basic  CELP  decoder  block  diagram  from  [33].  In  some  designs 
only  the  short-term  postfilter,  shown  in  the  dotted  box,  is  used.  It  can  be  again  noted  from 
Figure  4.5  that  the  post  filter  is  in  the  same  general  form  of  Eq(4.10).  Different  platforms 
and  speaker  acoustics  necessitate  different  values  for  7 and  /3.  They  are  precipitated  as 


77 


tunable  parameters,  and  in  most  cases,  their  final  values  are  left  to  the  discretion  of  the 
audio  team,  and  usually  determined  from  Mean  Opinion  Score  (MOS)  listening  tests.  The 
7 and  parameters  are  usually  hardcoded  after  evaluation  of  the  subjective  listening  tests. 

Figure  4.6  shows  the  short-term  filter  frequency  response  for  a vocal  tract  model  1 IA{z/^ 
with  various  values  of  the  bandwidth  expansion  parameter  yS.  A lOth-order  filter  is  usually 
sufficient  for  the  post  filter  [131].  Plots  are  separated  by  lOdB  for  clarity.  It  can  be  seen 
that  the  response  flattens  as  /?  decreases.  For  voiced  speech,  the  spectral  envelope  usually 


0 1000  2000  3000 

Frequency 


Figure  4.6:  Response  of  l/A(2/y0)  for  various  values  of  p. 


ha-s  a low-pass  spectral  tilt  with  roughly  6 dB  per  octave  spectral  fall-off.  This  results  from 
the  glottal  source  low-pass  characteristics  and  the  lip  radiation  high  frequency  boost.  A 7 
parameter  was  introduced  to  adjust  for  the  spectral  tilt  [15].  The  resulting  filter  transfer 
function  is  that  of  Eq(4.12). 


H{z) 


A{zh) 

A{z/P) 


(4.12) 


The  numerator  effectively  adds  an  equal  number  of  zeros  with  the  same  phase  angles  as  the 
poles.  In  effect,  the  post-filter  response  is  the  subtraction  of  the  two  bandwidth  expanded 
responses  seen  in  Figure  4.6  [15], 


201og|i7(e^*")|  =201og]l/A(e^’^//3)|  -201og|l/A(e^“'/7)|  (4.13) 

For  0 < 7 < ^ < 1,  201og  |l/A(e'^“'/7)|  is  a very  broad  response  which  resembles  the  low- 
pass  spectral  tilt.  Subtraction  of  this  response  results  in  a formant  enhanced  spectrum 


78 


with  little  spectral  tilt.  Informal  listening  tests  in  [15]  state  that  the  low-pass  effect  is 
significantly  reduced  after  inclusion  of  the  numerator  term,  though  slightly  muffled.  A 
pre-emphasis  filter  Ht{z)  is  sometimes  included  to  alleviate  this  low-pass  effect. 

It  should  be  noted  that  Eq(4.12)  describes  a filter  which  can  either  be  used  as  a formant 
enhancement  filter  i.e,  one  which  sharpens  formant  peaks,  or  as  a bandwidth  expansion 
filter  i.e,  one  which  dialtes  formant  bandwidths.  The  selection  depends  on  the  values  of  the 
7 and  /?  parameters.  For  0 < 7 < /3  < 1,  the  filter  provides  a sharpening  of  the  formants, 
or  a narrowing  of  the  formant  bandwidth.  This  is  evident  from  Figure  4.6  and  Eq(4.13) 
since  the  response  of  \H{z)\  is  similar  to  that  of  ll\A{z)\  without  the  tilt.  Such  a response 
would  be  similar  to  the  dotted  line  response  of  Figure  4.3b.  For  0 < < 7 < 1,  the  filter  is 

a bandwidth  expansion  filter.  Such  a filter  response  would  be  the  reciprocal  of  Figure  4.3a, 
where  the  formant  sidelobes  are  amplified  in  greater  proportion  than  the  formant  peaks. 

The  vocoder  perceptual  noise  weighting  and  postfilter  are  important  considerations 
for  this  work,  since  many  of  the  operations  in  vocoders  are  exclusively  designed  for  low- 
complexity  speech  processing  applications.  The  bandwidth  expansion  techniques  are  excel- 
lent candidates  for  a loudness  enhancement  method.  Loudness  increases  as  energy  exceeds 
beyond  a critical  bandwidth,  and  our  initiative  is  to  bandwidth  expand  vowel  regions  of 
speech  to  elevate  the  perceptual  loudness. 

4.4  Warped  Filtering 

In  this  section  a warped  filter  with  bandwidth  expansion  is  presented  as  a new  speech 
enhancement  method  to  adjust  formant  bandwidths  on  a critical  band  scale.  The  LPC  pole 
enhancement  technique  is  applied  in  the  warped  frequency  domain  to  accomplish  this  task. 
The  LPC  pole  enhancement  technique  provides  only  a fixed  bandwidth  increase  independent 
of  the  frequency  of  the  formant.  In  a Warped  LPC  filter  (WLPC),  the  all-pass  warping 
factor  can  provide  an  additional  degree  of  freedom  for  bandwidth  adjustment. 

Eq(4.7)  showed  how  the  Z transform  can  be  evaluated  on  a circle  of  radius  r given  the 
LP  coefficients.  This  operation  is  a function  of  the  radius  and  determines  the  amount  of 
bandwidth  change,  as  given  in  Eq(4.8),  which  is  constant  for  well  separated  poles.  The 


79 


elegance  of  the  technique  is  that  the  coefficients  can  be  scaled  directly.  What  would  be  nice 
is  to  have  this  same  technique  work  on  a critical  band  scale,  and  expand  the  bandwidths 
of  each  pole  on  a scale  closer  to  that  of  the  human  auditory  system.  In  chapter  3,  we  saw 
that  a multiplicative  bandwidth  increase  is  better  than  an  additive  increase  on  synthetic 
vowels.  Our  intent  in  this  filter  design  is  to  increase  the  overall  perception  of  loudness 
without  adding  signal  energy.  Expanding  beyond  critical  bands  increases  the  perception  of 
loudness.  It  would  be  useful  to  expand  the  pole  bandwidths  on  a critical  band  scale,  or 
at  least  have  a way  to  non-linearly  adjust  bandwidths  in  a predetermined  manner.  Such 
a solution  is  possible  using  warped  LPC  filter  design.  Warped  filters  have  primarily  been 
used  for  audio  modelling.  They  require  a lower  order  than  a general  FIR  or  IIR  filter 
for  auditory  modelling,  since  they  are  able  to  distribute  their  poles  in  accordance  with 
the  frequency  scale.  Since  warped  filter  structures  are  realizable,  the  linear  bandwidth 
expansion  technique  can  be  used  in  this  transformed  space  to  achieve  nonlinear  bandwidth 
expansion. 

Warping  refers  to  alteration  of  the  frequency  scale.  Conceptually  it  can  be  considered 
as  a stretching  or  compressing  of  the  spectral  envelope.  The  idea  of  a warped  frequency 
scale  FFT  was  originally  proposed  by  Oppenheim  [102].  It  has  been  applied  to  audio 
coding,  audio  filter  design  [4],  instrument  modelling,  and  predictive  coding  [70].  Warped 
linear  predictive  coding  has  also  recently  been  attractive  in  perceptual  vocoder  design  and 
high  fidelity  audio  [48].  The  warping  characteristics  allow  a spectral  representation  which 
closely  approximates  the  frequency  selectivity  of  human  hearing.  It  also  allows  lower  order 
filter  designs  to  better  follow  the  non-linear  frequency  resolution  of  the  peripheral  auditory 
system. 

A warping  transformation  is  a functional  mapping  of  a complex  variable.  For  warped 
filters  the  mapping  function  is  in  the  z domain,  and  must  provide  a one-to-one  mappings 
of  the  unit  circle  onto  itself.  The  two  pairs  of  transformations  are  between  the  2;  domain 
and  the  warped  z domain 


z = g{z) 

z = f[z) 


(4.14) 


80 


In  the  design  of  a warped  filter,  the  functional  transformations  must  have  an  inverse  mapping 
•2^  = 9{f{^)}-  It  must  be  possible  to  return  to  the  original  z domain.  The  bilinear  transform 
is  one  such  mapping  which  satisfies  the  requirements  of  being  one-to-one  and  invertible 
[103].  The  bilinear  transform  maps  the  entire  imaginary  axis  of  the  s plane  onto  the  unit 
circle  of  the  z domain.  It  has  been  used  in  analog-to-digital  filter  designs  to  map  piece-wise 
magnitude  responses  from  the  s plane  to  the  z plane.  It  is  also  been  used  for  discrete-time 
frequency  filter  transformations  to  convert  low-pass  prototypes  to  high-pass  and  band-pass 
filters  types  [133].  The  bilinear  transform  corresponds  to  the  first  order  all-pass  filter 


1 z ^ — a 

z = 


1 — az  ^ 


- 1 < a < 1 


(4.15) 


The  all-pass  has  a frequency  response  magnitude  independent  of  frequency  and  passes  all 
frequencies  with  unity  magnitude.  All-pass  systems  can  be  used  to  compensate  for  group 
delay  distortions  or  to  form  minimum  phase  systems.  In  the  case  of  warped  filters,  their 
predetermined  ability  to  distort  the  phase  is  used  to  favorably  alter  the  effective  frequency 
scale,  a is  the  dispersive  delay  element  and  sets  the  degree  of  frequency  warping.  A 
negative  value  of  a provides  the  inversion.  Digital  filters  typically  operate  on  a uniform 
frequency  scale  since  they  are  a multiple  composition  of  unit  delay  elements.  The  unit 
delays  by  definition  are  independent  of  frequency.  The  dispersive  elements  inject  frequency 
dependence  and  result  in  non-uniform  frequency  resolution. 

The  warped  filter  design  is  established  through  the  mappings  of  Eq(4.14)  [75].  The  z 
transform  H{z)  of  its  impulse  response  h{n)  must  equal  the  z transform  H{z)  of  its  impulse 
response  h{n).  And,  since  the  transform  is  invertible,  H{z)  must  equal  H{z). 

OO 

H{z)  = ^h{n)z-- 

n=0 

OO 

H{z)  = 

n=0 

From  this,  any  sequence  s(n)  were  n denotes  time  provides  a corresponding  frequency 
warped  sequence  s(n)  given  by 

OO  OO 

Y^s{n)z~^  = ^s(n)z“" 

n=0  n=0 


(4.16) 


81 


In  a frequency  warping  transformation,  the  convolution  of  two  sequences  is  transformed  into 
the  convolution  of  the  transformed  sequences  [130].  On  the  left  we  see  a transformation  of 
s and  a transformation  of  denoted  by  the  tilde  signs,  z is  the  substitution  of  the  all-pass 
elements,  and  the  s sequence  is  a linear  transformation  of  s that  sets  the  two  sides  equal.  If 
we  replace  the  unit  delays  with  all-pass  elements,  an  efficient  recursion  is  available  which  can 
be  applied  to  the  autocorrelation  sequence  Rn,  power  spectrum  P„,  prediction  parameters 
Op,  or  cepstral  parameters.  We  used  the  Oppenheim  recursion  [110]  on  the  autocorrelation 
sequence  for  our  frequency  warping  transformation. 

For  0 < n < p { (4.17) 

= a[rf  “^^]  + 

= a[fo”~^^]  + (1  — 

For  2 < k < p { 

]+;:(-■)}} 

The  transform  holds  only  for  a casual  sequence.  Since  the  autocorrelation  is  even,  we  can 
split  s up  into  the  sequence  ro/2,ri,r2,  ...r„,  where  r are  the  autocorrelation  values.  After 
the  recursion,  ro  has  to  be  doubled.  This  is  the  warped  autocorrelation  method  and  returns 
a wajped  autocorrelation  sequence  Rk  = r^^  ■ This  method  operates  directly  from  the  time 
sampled  autocorrelation  function  [130]. 

Another  method  obtains  the  warped  autocorrelation  by  Fourier  transforming  a warped 
power  spectrum.  The  power  spectrum  is  waxped  by  the  cosine-series  summation  [130].  The 
power  spectral  warping  method  of  Eq(4.18)  was  additionally  used  for  comparison.  Similar 
warping  characteristics  were  obtained.  In  Eq(4.18)  w and  dw  are  the  warping  functions 
presented  ahead  in  Eq(4.23)  and  Eq(4.24) 

L/2 

UknPn,  0<k  <p,  (4.18) 

n=0 

Ukn  = {dw  / dw)nCOs{kWn) 

After  the  warped  autocorrelation  function  fk  is  determined  the  warped  prediction  coef- 
ficients dk  are  obtained  by  solving  the  normal  set  of  equations.  This  can  be  done  with  the 


82 


Durbin  algorithm  Eq(4.19)  [114], 


eW 

= ^0 

z— 1 

ki 

i 

1 

III 

II 

(0 

Ti 

= ki 

(0 

V 

- Ij  Kili-j 

E^) 

= [l-kf)E^^-^\ 

1 < f < p 

a*: 

= 7^.^^  1 < fc  < P 

Figure  4.7  shows  the  resulting  frequency  response  of  the  Oppenheim  recursion  as  applied 
to  a synthetic  speech  segment  with  a = 0.47.  It  can  be  seen  that  the  method  stretches 
the  envelope  rightwards.  Critical  bandwidths  increase  with  increasing  frequency.  Since  the 
warped  spectrum  is  on  a critical  band  scale,  the  large-bandwidth,  high-frequency  regions 
of  the  original  spectrum  become  compressed,  and  effectively  result  in  a warped  spectrum 
stretched  towards  the  right.  For  0 < a < 1 frequency  warping  stretches  the  low  frequencies 
and  compresses  the  high  frequencies.  For  — 1 < a < 0 frequency  warping  compresses  the 
low  frequencies  and  stretches  the  high  frequencies. 


Figure  4.7:  Critical  band  frequency  warping  using  Oppenheim  recursion  on  autocorrelation 
sequence. 


83 


4.5  Warped  Filter  Structures 

In  this  section  we  derive  the  analytic  relation  between  the  warped  LPC  coefficients 
and  the  modified  filter  coefficients  nsed  in  the  final  filter  structure.  A coefficient  conver- 
sion is  necessary  for  stability  purposes  in  the  recursive  filter  section  as  will  be  discussed. 
The  conversion  also  allows  the  filter  structure  to  retain  the  Direct  Form  II  implementation. 
For  illustration  we  preform  element  analysis  on  a first-order  filter  section  and  extend  this 
to  an  n*^-order  filter.  Also,  for  simplicity  we  replace  the  notation  of  the  warped  prediction 
coefficients  a with  a.  It  should  be  kept  in  mind  that  all  references  to  a now  imply  reference 
to  the  warped  coefficients  a.  This  section  serves  to  show  the  derivation  of  the  modified  filter 
coefficients  from  any  coefficient  set  a. 

4.5.1  Analysis  Filter 

The  analysis  filter  is  referred  to  as  the  inverse  filter.  It  is  the  all-zero  filter  of  the  inverse 
all-pole  speech  model.  The  prediction  coefficients  a*,  define  the  prediction  error  (analysis) 
filter  given  by 

V 

= (4.20) 

k=0 

where  this  represents  a conventional  FIR  when  is  normalized  for  cq  = 1.  Figure  4.8  shows 
a 1 tap  delay  line  with  filter  coefficients  oq  and  oi.  This  is  a first  order  analysis  filter  and 
will  be  used  to  demonstrate  the  warping  characteristics  of  an  all-pass  element.  We  directly 


Figure  4.8;  Analysis  filter  element. 


replace  the  unit  delay  operator  of  the  linear  phase  filter  with  an  all-pass  element  as  shown 
in  Figure  4.9.  The  all-pass  element  is  within  the  dotted  box.  Higher  order  models  similarly 


84 


Figure  4.9:  Unit  delay  replacement  with  all-pass. 


replace  each  of  the  unit  delays  with  all-pass  elements.  Nodal  analysis  of  the  all-pass  element 
gives  the  transfer  characteristics. 


X(z) 

= Y(z)z-^  ■ 

Y(z) 

= S(z)z-^- 

V(z) 

z~^  — a 

S(z) 

1 — az“^ 

(4.21) 


Eq(4.21)  is  the  standard  all-pass  transfer  function  as  noted  in  Eq(4.15),  and  is  typically  put 
in  the  general  substitution  form  of  Eq(4.22) 


1 z ^ — a 

^ = 3T 


1 — az 


1 < a < 1 


(4.22) 


The  feedback  term  a provides  a time  dispersive  element  that  provides  the  warping  char- 
acteristics. By  virtue,  the  all-pass  element  passes  all  signals  with  equal  magnitude.  We 
can  evaluate  the  warping  characteristics  by  solving  for  the  phase.  The  phase  response 
demonstrates  the  warping  properties  of  the  all-pass.  Setting  z = 

■ 1 - 


1 - ae-J" 


85 


and  solving  for  the  phase  w in  terms  of  w,  we  get 


it 


~w 


w 


it  it  y(n;) 


—w  + arctan 
w + 2 arctan 


arg[l  — ae+-^“'] 
arg[l  — ae~^^] 
a sinzn 
1 — Q!  cos  w 


(4.23) 


Eq(4.23)  gives  the  phase  characteristics  of  the  all-pass  element,  where  a sets  the  level 
of  frequency  warping.  An  alternate  derivation  provided  in  Appendix  B yields  an  equivalent 
representation  of  the  phase  characteristics  [103,  133]. 


w = arctan 


(1  — a^)  sintc 
(a^  + 1)  cosw  -)-  2a 


Figure  4.10  shows  the  different  warping  characteristics  set  by  a.  For  a > 0 low  frequen- 
cies are  expanded  high  frequencies  are  compressed.  For  a < 0 high  frequencies  are  expanded 
and  low  frequencies  are  compressed,  a has  the  effect  of  setting  the  warping  characteristics. 
When  a = 0 there  is  no  warping  and  the  filter  structure  reduces  to  that  of  a general  FIR 
filter  seen  in  Figure  4.8. 


Figure  4.10:  Frequency  warping  characteristics  of  the  all-pass  element  described  by  Fq(4.23) 
for  different  values  of  the  warping  factor  —0.8  < a < 0.8. 


86 


The  derivative  of  Eq(4.24)  states  the  level  of  compression  or  expansion  [26].  A positive 
value  implies  compression  and  a negative  value  implies  expansion. 

— = (1  — a^)/(l  + — 2acosu;)  (4.24) 

Ci  UJ 

Figure  4.11  shows  the  frequency  warping  characteristics  of  the  all-pass  system  given  by 
Eq(4.23)  with  a = 0.47  compared  to  the  analytic  formula  of  the  critical  band  scale  proposed 
by  Zwicker  in  Eq(2.3).  For  a sampling  frequency  of  lOKHz,  a = 0.47  provides  a very  good 
approximation  to  the  critical  band  scale,  and  is  a standard  approximation  used  by  many 
researchers  [130].  a is  positive  for  critical  band  warping  and  depends  on  the  sampling 


Figure  4.11:  Frequency  warping  characteristics  of  all-pass  {dotted  from  Eq(4.23))  compared 
to  critical  band  scale  {solid  from  Eq(2.3)). 


frequency  by  the  following  equation  [48], 


a = 1.0674<  / — tan" 


2 /0.06583/s 

fora  •1-  I 


7T 


1000 


0.1916 


(4.25) 


Continuing  with  the  nodal  analysis  of  Figure  4.9,  we  solve  for  the  error  e(n)  to  give  us 
the  transfer  function  of  the  Ist-order  filter  section. 


S{z) 


— ao  -f-  ot 


1 — az 


a 


(4.26) 


We  see  that  the  response  is  the  standard  FIR  response  in  the  form  of  a linear  combiner  with 
a delay  operator  of  an  all  pass  filter.  This  is  also  the  description  of  a Laguerre  Filter,  which 


87 


is  closely  related  to  the  warped  FIR  transversal  filter  [91].  Warped  filter  designs  based 
on  the  Short  Time  Laguerre  Transform  (STLT)  have  recently  been  proposed  for  real-time 
implementation  of  audio  frequency  warping  [28,  29].  For  an  order  filter  of  the  same 
form,  we  can  put  this  in  the  standard  notation  where  z represents  the  all-pass  element. 
A{z)  represents  the  all-zero,  or  prediction  error  filter. 

p 

= “0  = 1 (4.27) 

k=0 

The  analysis  demonstrates  the  direct  substitution  of  an  all-pass  filter  into  the  unit  delay. 
This  is  a straightforward  substitution  for  the  FIR  (analysis)  form  of  any  order.  Figure  4.12 
shows  the  direct  form  of  the  prediction  error  filter  for  p = 3 with  unit  delays  replaced  by 
all-pass  elements.  In  a warped  recursive  filter,  we  will  see  the  all-pass  delay  for  the  synthesis 
filter  is  not  a simple  substitution. 


Figure  4.12:  Direct  substitution  of  all-pass  elements  in  FIR. 


88 


4.5.2  Synthesis  Filter 


The  synthesis  filter  is  a recursive  filter  with  the  HR  form  described  by  A ^ (z)  and  shown 
in  Figure  4.13 

1 


A-^(z)  = 


(4.28) 


jfc=0 


However,  the  synthesis  filter  with  the  insertion  of  an  all-pass  element  is  not  a straightforward 
unit  delay  replacement.  The  substitution  of  all-passes  into  the  unit  delay  of  the  recursive 


Figure  4.13:  Synthesis  filter  element. 

HR  form  creates  a lag  free  term  in  the  delay  feedback  loop.  If  we  replace  the  unit  delay 
operator  z~^  with  the  all-pass  we  see  this  delay  free  term. 

G(z)  z~^  — a 


z 


.S';;:;  1 - az~'^ 


= (1-a^) 


1 — az' 


— a 


The  lag  free  term  a must  be  incorporated  into  a delay  structure  which  lags  all  terms  equally 
to  be  realizable  [130].  The  problem  with  a delay  free  term  is  that  it  can  propagate  through 
the  feedback  path  and  form  an  un-realizable  time  dependency.  We  can  see  this  through  a 
direct  implementation  of  the  all-pass, 

g{n)  = s{n  - 1)  + a(g{n  - 1)  - s{n)) 
s{n)  = ao{e{n)  - aig{n)) 


89 


We  see  that  s{n)  depends  on  g{n)  which  in  turn  depends  on  s(n).  The  present  value  is 
dependent  on  its  current  value  which  it  hasn’t  calculated  yet.  Realizable  warped  recursive 
filter  designs  have  been  proposed  to  mediate  this  problem  and  elegant  solutions  exist  [48,  74], 
One  solution  is  a delay  splitting  procedure  and  allows  the  filter  to  retain  its  original  structure 
without  modification  of  the  filter  coefficients  [49]. 

In  this  study  the  difference  equation  of  the  filter  is  manipulated  to  yield  a new  filter 
structure  with  a new  set  of  coefficients.  This  method  for  realization  of  the  warped  f IR  form 
requires  the  all-pass  delay  elements  to  be  replaced  with  first-order  low-pass  sections.  The 
filter  structure  will  be  stable  if  the  warping  is  moderate  and  the  filter  order  p is  low  [74]. 
Eq(4.27)  can  be  expressed  as  a polynomial  in  z“^/(l  — az~^)  with  the  substitution  of  z in 
Eq(4.29).  This  allows  a mapping  of  the  prediction  coefficients  to  a coefficient  set  used 
directly  in  a standard  recursive  filter  structure  [130].  The  derivation  is  given  in  Appendix 
A.  fn  this  manner  the  all-pass  lag-free  element  is  removed  from  the  open  loop  gain  and  a 
realizable  warped  HR  filter  for  A~^[z)  is  possible. 


fc=o 


,-i 


1 — az 


-1 


(4.29) 


The  fejfc  coefficients  are  generated  by  a linear  transformation  of  the  warped  LPC  coefficients 
Ofc,  using  the  binomial  equations  [130], 

V 

(4.30) 


bk  = 
Ckn  — 


^ ^ Ckn^ 

n=k 

. k 


\n—k 


Eigure  4.14  shows  the  resulting  filter  structure  from  replacing  the  unit  delay  in  Figure  4.13 
with  an  all  pass,  and  transforming  the  coefficients  by  Eq(4.30).  The  binomial  equations 
are  equivalent  to  multiplication  with  a fixed  triangular  matrix  b = a$,  and  the  recursive 
algorithm  is  provided  in  Appendix  A. 


4.5.3  Direct  Form  Filter 

It  has  been  necessary  to  apply  a linear  transformation  to  the  filter  coefficients  in  the  HR 
filter  to  maintain  stability.  In  the  same  manner  of  analysis,  we  can  restructure  the  all-pass 


90 


Figure  4.14:  Modified  synthesis  filter  element  after  coefficient  transformation. 

FIR  of  Figure  4.12  to  a more  suitable  and  consistent  representation  with  the  IIR  filter. 
Reverting  back  to  the  analysis  filter  of  Figure  4.9,  the  same  linear  transformation  of  the 
coefficients  can  be  applied  for  the  design  of  the  modified  FIR  form.  Figure  4.15  shows  the 
results  of  replacing  the  unit  delay  in  Figure  4.9  with  an  all  pass,  and  transforming  the  a*, 
coefficients  by  Eq(4.30).  This  is  the  modified  FIR  tapped  delay  line  form,  where  modified 
implies  the  conversion  of  the  filter  coefficients. 


Figure  4.15:  Modified  analysis  filter. 

Nodal  analysis  demonstrates  identical  results  with  the  all-pass  form  of  Figure  4.9.  The 
filter  structure  is  equivalent  for  the  FIR  form,  where  the  bk  coefficients  represent  the  trans- 
formation of  the  ttk  coefficients. 

X{z)  = aY{z)  + S{z) 

Y{z)  - X{z)z-^ 

= 2-1  [aT(z) -fi  5(z)] 

yjz) 

S{z) 


\ — az  ^ 


(4.31) 


91 


Eq(4.31)  shows  the  filter  delay  is  a function  of  the  polynomial  z V(1  ~ olz  as  noted  in 
Eq(4.29).  The  complete  filter  transfer  function  is  described  by 


A{z)  = bo  + biz  ^{1  — az  ^ 
E{z)  = A{z)S{z) 

= boS{z)  + biY{z) 


= S{z) 


bo  + bi 


I — az 


E{z) 

S{z) 


.-1 


= bo  + bi 


1 — az 


-1 


(4.32) 


If  the  dotted  box  of  Figure  4.15  is  considered  a delay  element,  the  filter  resembles  the 
standard  FIR  filter  form  with  coefficients  b^.  Replacing  bk  coefficients  with  Eq(4.30) 

E{z)  \ 2\ 

= (ao  - aai)  + (1  — a )ai 


S(z) 


— do  + 


1 — az  ^ 

~aai  + a‘‘aiz~^  + aiz~ " — a"aiz 


(1  — az  1) 


-1 


— Oq  + Ul 


1 — az 


a 


(4.33) 


we  see  the  modified  direct  form  filter  of  Eq(4.33)  exactly  matches  the  direct  form  filter  of 
Eq(4.26).  Now  both  FIR  (analysis)  and  IIR  (synthesis)  filters  are  in  their  modified  direct 
form  and  expressed  as  a polynomial  in  z~V(l  ~ az~^).  This  is  the  acceptable  form  of  a 
WLPC  vocoder  as  shown  in  Figure  4.16  since  the  synthesis  filter  can  be  constructed  directly 
with  respect  to  the  warped  frequency  scale. 

Warped  LPC  filters  were  originally  proposed  for  perceptual  vocoder  designs  [130].  The 
analysis  filter  on  the  left  is  the  transmitting  end,  and  the  synthesis  filter  on  the  right  is 
on  the  receiving  end.  The  analysis  stage  performs  the  parameter  extraction  and  speech 
encoding.  In  general,  the  bk  coefficients  and  other  voicing  parameters  form  the  transmitted 
prediction  sequence.  The  encoded  data  pa.sses  through  the  transmission  channel  and  is 
decoded  on  the  receiving  end.  The  error  signal  e(n)  represents  the  excitation  sequence 
and  is  a function  of  the  vocoder  design.  In  CELP  type  coders,  the  error  signal  is  not 
sent  and  excitation  sequences  are  generated  at  the  receiving  end  as  a function  of  the  voicing 
parameters  and  filter  coefficients.  At  the  synthesis  stage,  the  speech  parameters  are  decoded 


92 


Figure  4.16:  WLPC  vocoder  cited  in  [130]. 

and  re-synthesized  into  speech.  The  filter  structure  of  Figure  4.16  constitutes  a complete 
analysis/synthesis  operation.  The  signal  s(n)  will  be  exactly  reconstructed  from  s(n)  given 
the  error  signal  e(n). 


4.5.4  Warped  Bandwidth  Expansion 

Our  design  goal  is  to  use  the  warping  filter  structure  of  Figure  4.16  as  a speech  enhance- 
ment filter.  We  would  like  to  use  the  LPC  bandwidth  expansion  technique  previously  given 
by  Eq(4.10)  in  a warped  filter  structure.  In  the  LPC  form  the  bandwidth  expansion  filter 
is  represented  as 

(4.34) 

where  7 represents  a bandwidth  broadening  factor.  This  equation  shows  that  the  expansion 
operates  directly  on  the  LPC  coefficients  ak  given  the  notation  of  A{z).  The  warped  LPC 
filter  requires  a linearly  transformed  coefficient  set  bk  from  the  warped  Ofc  coefficients.  In 
this  regard,  the  bk  coefficients  can  be  conceptually  thought  of  as  the  Uk  coefficients  evaluated 
in  a warped  z domain,  z.  In  this  case,  a bandwidth  expansion  technique  can  be  directly 


93 


applied  to  the  warped  coefficient  set  to  achieve  non-linear  bandwidth  expansion.  Eq(4.9) 
demonstrated  that  an  evaluation  of  the  Z transform  on  a circle  greater  than  the  unit  circle 
resulted  in  a constant  bandwidth  expansion  for  all  poles.  All  poles  experience  the  same 
bandwidth  increase  regardless  of  individual  bandwidth.  By  applying  this  technique  in 
the  warped  domain,  a constant  bandwidth  expansion  in  the  warped  domain  translates  to 
nonlinear  expansion  in  the  linear  domain.  Eq(4.35)  shows  the  expansion  technique  in  the 
warped  domain. 


H[z) 


A{z) 

A{zh) 


(4.35) 


The  expansion  technique  of  Eq(4.34)  is  incorporated  in  the  warped  filter  design  of  Eq(4.29). 
Application  of  the  linear  expansion  technique  in  the  warped  domain  effectively  bandwidth 
broadens  the  poles  on  a nonlinear  scale  given  by  the  warping  factor  a.  Selection  of  a using 
Eq(4.25)  bandwidth  broadens  the  poles  close  to  the  critical  band  scale.  In  Appendix  A we 
show  where  the  off-axis  radius  term  r = I/7  is  incorporated  into  the  linear  derivation  of  the 
warped  prediction  coefficients.  This  results  in  the  modification  of  the  binomial  equation 

h = '^  Cknan,  for  Ckn  = (?)(1  “ 

n=k  ^ ' 


or,  recursion 

bp  — dp 

for  n = l...p  — 1{ 

bp—n  ~ Op—fi  r cxbp—ji+i 
if  n>  1{ 

fork=p  — n + l...p  — 1{ 
bk  = r~^{l  - a^)bk  - 
bp  = r~^ {I  - a^)bp} 


94 


4.5.5  Filter  Structure 

We  have  shown  how  to  include  the  bandwidth  broadening  factor  in  the  calculations  of 
the  transformed  coefficients.  The  HR  synthesis  filter  bandwidth  broadens  the  excitation  se- 
quence generated  by  the  FIR  filter  stage.  We  can  rearrange  the  filter  of  Figure  4.16  without 
changing  the  overall  system  function.  Since  both  of  these  systems  are  linear  time-invariant 
for  a fixed  window  of  speech  (assuming  initial  rest  conditions  for  the  delay  registers),  the 
order  in  which  the  two  systems  are  cascaded  can  be  reversed  [103],  as  shown  in  Figure  4.17. 
Effectively,  we  have  interchanged  the  cascade  order.  The  HR  filter  is  now  on  the  left,  and 
the  FIR  is  on  the  right.  The  direct  realization  of  the  filter  transfer  function  is  known  as  the 
direct  form  I or  canonic  form  implementation. 


Figure  4.17:  Changing  implementation  order  of  WLPC  vocoder  for  use  as  a WIIR  filter. 


Theoretically,  the  order  of  implementation  does  not  affect  the  overall  system  function. 
However,  when  a difference  function  is  implemented  with  finite-precision  arithmetic,  there 
can  be  a significant  decrease  between  implementations.  For  this  reason,  cascade,  parallel, 
and  lattice  form  filters  are  designed  for  robustness  against  the  effects  of  coefficient  quan- 
tization. Such  coefficient  sensitivity  typically  renders  the  direct  form  filters  of  Figure  4.17 
useless  for  implementations  greater  than  second  order.  The  only  exception  is  in  speech 


I 


95 


processing,  where  systems  of  10th  order  and  higher  are  implemented  in  direct  form  [103]. 
This  is  possible  because  the  poles  of  the  system  function  are  widely  separated  [114], 


Figure  4.18:  Formant  bandwidth  expansion  filter  with  frequency  scale  set  by  locally  recur- 
rent a parameter. 

In  Figure  4.17,  we  see  the  center  delay  chain  stores  the  same  values  used  in  both  sys- 
tems. This  allows  us  to  reduce  the  number  of  delays  and  collapse  the  two  delay  chains 
together.  This  is  referred  to  as  the  direct  form  II  implementation  or  canonic  direct  form 
implementation,  since  it  satisfies  the  minimum  number  of  delays.  Figure  4.18  shows  our 
canonic  direct  form  of  the  WLPC  filter  with  critical  band  expansion.  The  hk  coefficients 
(on  the  left)  are  the  bandwidth  expanded  terms  in  the  HR  structure  (r  > 1).  This  is  the 
general  form  of  the  warped  bandwidth  expansion  filter  used  to  adjust  the  formant  poles  on 
a critical  band  scale. 

Figure  4.19  shows  the  family  of  bandwidth  expansion  curves  given  a particular  sampling 
frequency  and  evaluation  radius.  This  figure  characterizes  the  warped  bandwidth  filter.  In 
this  figure  fs  = 8KHz,  and  the  evaluation  radius  is  r = 1.02.  The  a values  specify  the  level 
of  bandwidth  expansion  or  compression.  For  a ^ 0 the  intersection  of  each  curve  with  the 
a = 0 curve  sets  the  crossover  frequency.  The  crossover  frequency  corresponds  to  the  zero 


96 


crossing  of  the  function  derivative  given  in  Eq(4.24).  It  can  be  seen  from  Figure  4.19  that 
at  a = 0 there  is  uniform  bandwidth  expansion  across  all  frequencies  and  the  bandwidth 
corresponds  io  B = log(r)/s/7r,  which  is  B = 50Hz  for  fs  = 8KHz  and  a = 0. 


Figure  4.19:  Family  of  curves  showing  frequency  dependent  bandwidth  expansion  for  a 
particular  evaluation  radius  from  the  warped  filter. 

The  change  in  bandwidth  is  specified  by  the  evaluation  radius,  sampling  frequency, 
and  a values.  The  bandwidth  expansion  is  constant  in  the  warped  domain  and  given  by 
Eq(4.9).  The  constant  warped  bandwidth  is  evaluated  in  the  linear  domain  to  determine 
the  linear  bandwidth  expansion.  Eqs(2.2)  and  (2.3)  are  analytic  expressions  to  relate  the 
critical  band  rate  to  frequency  in  Hertz.  The  linear  bandwidth  change  is  evaluated  from 
the  warped  bandwidth  using  the  all-pass  inversion.  As  was  noted  in  Chapter  2,  Zwicker’s 
formula[150]  does  not  have  a numerically  invertible  expression.  Traunmuller  [134]  provided 
an  invertible  equation  given  by  Eq(2.3).  However,  this  is  a piecewise  approximation  which 
does  not  have  a smooth  derivative  function  and  generates  a frequency  shift  in  the  analysis 
at  the  boundaries  of  the  piecewise  sections.  The  all-pass  approximation  with  negative  a 
generates  a continuous  inversion,  and  was  used  to  calculate  the  bandwidth  changes  in  Figure 
4.19. 

Figure  4.20  shows  the  warped  filter  gain  of  Eq(4.35)  as  a function  of  frequency.  A unit 
amplitude  sinusoidal  chirp  signal  from  IHz  to  4KHz  is  swept  through  the  warped  filter  of 
Figure  4.18  to  characterize  the  frequency  gain.  The  curves  are  normalized  such  that  at 


97 


a = 0 unity  gain  is  achieved.  This  truly  models  the  transfer  characteristics  of  the  filter. 
A higher  than  normal  filter  order  (26th  compared  to  6- 12th  order  WLPC  for  speech)  is 
required  to  model  the  sharp  poles  of  the  sinusoidal  chirp  and  evaluate  their  individual 
gains.  It  should  be  evident  from  the  figure  that  the  warped  bandwidth  expansion  filter 
induces  a nonlinear  frequency  gain.  The  intent  of  the  warped  filter  design  was  to  only 
adjust  formant  bandwidths  on  nonlinear  frequency  scale  given  by  a,  not  the  formant  gains. 
The  filter  response  in  Figure  4.20  of  Eq(4.35)  shows  an  inherent  frequency  dependent  gain. 


500  1000  1500  2000  2500  3000  3500  4000 
Frequency  (Hz) 


Figure  4.20:  Warped  filter  output  gain  curves  (normalized  for  unity  at  a = 0)  for  a sinusoidal 
chirp  signal  on  an  evaluation  radius  of  1.02. 

In  linear  prediction,  minimization  of  the  mean  squared  error  implied  maximization  of 
the  prediction  gain  [85].  Minimization  of  the  mean  squared  error  results  in  maximal  spectral 
flatness  of  the  prediction  error.  In  essence,  by  solving  the  normal  set  of  LPC  equations  we 
guarantee  minimization  of  the  mean  square  error  by  a whitening  transform.  In  warped  linear 
prediciton,  Strube  [130]  show  that  spectral  flatness  and  minimum  power  of  the  error  are  no 
longer  equivalent  criteria.  The  error  signal  e{t)  is  generated  by  Altering  the  speech  signal 
s{t)  with  the  warped  linear  prediction  coefficients  a^.  These  coefficients  were  determined  by 
solving  the  normal  equations  for  the  warped  autocorrelation  function  R{w)  giving  E{w)  = 


98 


|^(gju;)|2|_g'(y^)|2  'j'jjjg  results  in  a prediction  error  power 


= 


r 

/ E{w)dw 

J k——TX 

r F(  \ 

Jk=-„  ^ (1  + Qf^  — 2o!Cosrc) 


dw 


(4.36) 


with  the  substitution  of  dw  from  Eq(4.24).  This  implies  that  the  minimization  is  over  a 
weighted  error  i.e;  the  error  signal  filtered  by  W{z)  = \/l  — a^/(l  — az~^).  If  we  were 
to  make  a comparison  between  spectral  flatness  measures  in  warped  and  linear  predicition 
it  would  be  necessary  to  account  for  this  low-pass  weighting  [48].  The  low  pass  effect  of 
Eq(4.36)  creates  the  frequency  gain  dependence  seen  in  Figure  4.20.  We  can  alleviate  this 
effect  and  compensate  the  gain  curves  by  applying  a high-pass  Alter  of  the  form 


1 — CV7  ^ 

Hk,f{z)  = -j==  (4.37) 

V 1 — 

Figure  4.21  shows  the  gain  compensation  curves  generated  by  Eq(4.37)  for  positive  and 
negative  warping  parameters  a.  Eq(4.34)  represents  the  transfer  function  of  the  warped 
bandwidth  expansion  filter  to  be  complemented  with  the  filter  of  Eq(4.37).  It  should  be 
noted  that  the  all-pole  model  in  the  numerator  generates  the  true  residual  (error)  signal. 
This  signal  is  then  effectively  filtered  by  the  bandwidth  expanded  model  in  the  denominator. 
This  implies  a complete  re-synthesis  of  the  speech  signal.  A preferred  approach  is  to  shape 
the  spectrnm  from  a bandwidth  expanded  version  of  the  all-pole  model  without  complete 
re-synthesis.  This  is  a technique  used  in  the  vocoder  post-filter  design  of  section  4.3.2  [15]. 
The  bandwidth  expansion  technique  is  applied  to  the  numerator  to  attenuate  formant  peaks 
in  relation  to  formant  sidelobes.  For  0 < /3  < 7 < 1,  the  warped  post-filter  of  Eq(4.38) 
performs  the  bandwidth  expansion  by  spectral  shaping  without  complete  re-synthesis.  We 
can  represent  the  warped  filter  with  the  gain  compensation  filter  as, 

(4.38) 

where  7 = 0.8  and  /?  = 0.4  are  suitable  expansion  factors  for  loudness  enhancement  with  a 
lOth-order  all  pole  model. 


99 


Figure  4.21:  Family  of  gain  compensation  curves  given  by  Eq(4.37). 

In  Figures  4.22  and  4.23,  we  demonstrate  the  effect  of  the  warping  parameter  on  a syn- 
thetic vowel  segment.  Eq(4.9)  demonstrated  that  the  linear  bandwidth  broadening  tech- 
nique provides  only  a constant  bandwidth  change  as  the  evaluation  is  moved  off  the  unit 
circle.  The  warping  factor  a provides  an  additional  degree  of  freedom  to  adjust  the  fre- 
quency scale.  In  Figure  4.22  we  see  that  an  evaluation  off  the  unit  circle  with  r=1.01 
provides  an  equal  change  in  pole  bandwidths  which  is  about  37-39Hz.  The  original  for- 
mant bandwidths  shown  as  the  dotted  lines  are  all  50Hz.  The  synthetic  vowel  excitation 
signal  (residual)  is  synthesized  with  the  all-pole  model  lfA{z/P),  where  P is  the  bandwidth 
broadening  factor.  The  synthetic  pole  bandwidths  were  all  originally  50Hz.  Evaluation  off 
the  unit  circle  increases  all  the  bandwidths  by  roughly  30Hz. 

In  Figure  4.23  we  include  a warping  factor  of  0.34  to  alter  the  frequency  resolution 
for  illustration.  The  synthetic  vowel  excitation  signal  (residual)  is  synthesized  with  the 
warped  all-pole  model  \/ A{z/P),  where  a is  the  warping  factor  on  the  right  hand  slider  of 
the  GUI  that  sets  the  frequency  scale.  In  this  case  the  change  in  pole  bandwidths  are  no 
longer  constant.  For  a positive  warping  factor  the  higher  formant  bandwidths  will  increase 
in  greater  proportion  than  the  lower  formant  bandwidths.  For  a negative  warping  factor 
the  lower  formant  bandwidths  will  increase  in  greater  proportion  than  the  higher  formant 
bandwidths. 


100 


Vowel  time  response 


0 

LPF 


HPF 


Frequency 


B=89 


0 

Warp 


Figure  4.22;  Spectral  envelope  of  a synthetic  vowel  for  warped  bandwidth  expansion 
\/A{z/P)  (solid)  and  original  \IA(z)  (dotted).  Demonstrates  an  evaluation  off  the  unit 
circle  with  a warping  factor  a = 0 results  in  a uniform  bandwidth  change  for  all  formants; 
a)  time  response,  b)  frequency  response,  and  c)  spectral  envelope.  One  slider  is  used  to 
set  the  warping  factor  —0.6  < a < 0.6.  Another  is  used  to  set  the  evaluation  radius  and  a 
third  slider  allows  first  order  low-pass  or  high-pass  filtering  1 ± fj,z~^  to  adjust  for  spectral 
tilt.  Loudness  levels  using  the  ISO-532B  are  given  for  the  original  and  processed  vowel,  and 
original  formant  bandwidths  (dotted)  are  all  50Hz. 


101 


Vowel  time  response 


Figure  4.23:  Spectral  envelope  of  a synthetic  vowel  for  warped  bandwidth  expansion 
\/A{z/^)  {solid)  and  original  \IA{z)  {dotted).  Demonstrates  an  evaluation  off  the  unit 
circle  with  a warping  factor  a = 0.34  results  in  non-uniform  bandwidth  change  for  all  for- 
mants; a)  time  response,  b)  frequency  response,  and  c)  spectral  envelope.  One  slider  is  used 
to  set  the  warping  factor  —0.6  < a < 0.6.  Another  is  used  to  set  the  evaluation  radius  and 
a third  slider  allows  first  order  low-pass  or  high-pass  filtering  l±ij,z~^  to  adjust  for  spectral 
tilt.  Loudness  levels  using  the  ISO-532B  are  given  for  the  original  and  processed  vowel,  and 
original  formant  bandwidths  {dotted)  are  all  50Hz. 


102 


4.6  Auditory  Modelling 


When  a > 0 the  warping  transform  places  more  poles  in  the  lower  frequency  regions.  In 
comparison  to  LPC  with  the  same  number  of  poles,  the  warped  response  will  have  a better 
fit  to  the  data  in  the  low  frequency  regions.  When  a < 0 more  poles  are  placed  in  higher 
frequency  regions.  One  property  of  a minimum  phases  system,  such  as  A{z),  is  that  the  log 
spectrum  is  zero  mean.  If  all  the  zeros  are  inside  the  unit  circle,  the  inverse  will  be  analytic 
on  and  within  the  unit  circle. 


1 

27T 


^doj  = 0 


(4.39) 


The  proof  of  this  property  is  given  in  [85]  and  is  shown  through  residual  calculus  to  be. 


= 27l{ln[^(oo)]}  = 27^[ln(l)]  = 0 


(4.40) 


Thus,  the  log  spectrum  has  equal  area  above  and  below  the  OdB  reference  line.  This 
property  also  holds  for  a warped  filter  A{z)  if  the  integration  is  preformed  on  the  warped 
frequency  axis  with  respect  to  w.  However,  the  property  does  not  hold  if  the  warped 
frequency  response  is  evaluated  on  the  linear  frequency  axis  to,  [48].  A gain  term  can  be 
seen  in  the  filter  if  the  unit  delay  z~^  is  replaced  with  the  all-pass  delay, 

.-1 


(4.41) 


This  results  in  a gain  term  [48] 

p 

g = A(-a)  = (1  - X]  «o  = 1 (4.42) 

k=i 

Figure  4.24  shows  a WLPC  response  on  the  linear  frequency  axis  after  gain  adjustment. 
Prior  to  gain  adjustment  the  WLPC  spectrum  was  slightly  lower.  The  zero  mean  property 
of  the  log  spectrum  is  satisfied. 

Warped  filters  have  been  successfully  applied  to  auditory  modelling  and  audio  equaliza- 
tion designs  [4,  73].  Figure  4.25  shows  a 32nd  order  LPC  and  WLPC  model  response  for 
a synthetic  vowel  /a/  at  fs  = 8KHz  on  a linear  axis,  and  with  a = 0.47  on  a warped  axis 
approximating  the  critical  band  scale.  The  warped  filter  design  of  section  4.4  was  used  to 


103 


Warped  spectrum  gain  adjusted 


Figure  4.24:  WLPC  Gain  adjustment. 

generate  the  all-pole  model  ct^/|j4(z)P,  where  is  the  residual  error  defined  as 

a2=a'^Ra  = |^|/|Rp_il 

where  a are  the  warped  prediction  coefficients  from  Eq(4.19)  and  R is  the  toeplitz  matrix 
of  the  warped  autocorrelation  sequence  from  Eq(4.17).  A property  of  the  all  pole  model  is 
that  the  calculation  of  can  also  be  carried  out  by  the  recursive  minimization  relationship 
using  the  determinant  of  the  autocorrelation  matrices,  shown  in  the  second  equation.  The 
WLPC  model  of  Figure  4.25  effectively  places  more  poles  in  the  low  frequency  regions  due 
to  the  warped  frequency  scale,  and  thus  shows  pronounced  emphasis  where  the  poles  have 
migrated.  A higher  than  normal  order  is  used  to  demonstrate  the  differences.  The  same 
order  WLPC  model  clearly  discriminates  more  of  the  low  frequency  peaks  than  the  linear 
model.  The  WLPC  analysis  demonstrates  that  we  can  achieve  a better  fit  to  the  auditory 
spectrum  with  a lower  order  filter  compared  to  LPC.  In  this  example  we  do  not  want  to  use 
a model  order  high  enough  to  resolve  the  pitch  harmonics.  We  want  to  keep  the  excitation 
and  the  vocal  envelope  separate,  but  the  example  shows  us  the  modelling  accuracy  of  WLPC 
for  the  auditory  spectrum. 

We  have  shown  how  WLPC  is  used  for  auditory  modelling,  and  how  the  warped  filter  of 
Figure  4.18  achieves  nonlinear  bandwidth  expansion.  We  have  also  shown  the  relation  and 
interdependency  of  pole  bandwidth  and  filter  gain  as  a function  of  frequency.  The  premise 
of  the  warped  design  was  to  expand  formant  bandwidths  on  a critical  band  scale.  We  have 


104 


Figure  4.25:  Model  of  the  synthetic  vowel  / a/  with  LPC  and  WLPC  envelope  on  linear 
frequency  scale  (top)  and  warped  scale  (bottom). 

shown  that  loudness  increa.ses  when  a critical  band  is  exceeded,  and  the  motivation  behind 
the  warped  filter  design  was  to  increase  loudness  by  expanding  bandwidths  on  a critical 
band  scale.  The  warping  parameter  a defines  the  nonlinear  frequency  scale.  It  would  also 
be  of  interest  to  know  if  another  method  or  technique  could  be  employed.  We  can  use  the 
bandwidth  values  of  Figure  4.19  in  Eq(4.9)  to  solve  for  the  radius  gains  in  the  linear  domain 
for  a particular  evaluation  radius.  Figure  4.26  shows  the  pole  scaling  factor  necessary  to 
achieve  the  bandwidth  changes  in  Figure  4.19.  As  can  be  seen,  an  evaluation  radius  of  1.02 
corresponds  to  a scaling  of  1/1.02  = 0.98  at  a = 0.  The  radius  gains  can  be  used  to  scale  the 
speech  poles  to  achieve  the  same  bandwidth  expansion  effect  as  the  warped  filter.  On  first 
glance  this  may  seem  a much  simpler  solution  than  the  warped  filter  design.  However,  the 
entire  intention  of  the  warped  filter  was  to  utilize  the  power  series  scaling  pole  displacement 
model  of  Eq(4.7)  in  the  warped  frequency  domain.  This  is  a practical  technique  with  LPC 
vocoders,  since  efficient  algorithms  and  lattice  filters  have  been  developed  to  solve  for,  and 
implement,  the  linear  prediction  coefficients.  Constant  scaling  of  the  prediction  coefficients 


105 


Figure  4.26:  Pole  radii  scaling  necessary  to  achieve  bandwidth  effects  of  Fig(4.19). 

by  a power  series  is  extremely  powerful  and  efficient.  For  the  former  method,  a root  finding 
technique  is  required,  and  then  the  poles  can  be  individually  scaled  to  achieve  the  same 
bandwidth  changes.  However,  root  extraction  algorithms  are  computationally  expensive, 
and  the  warped  filter  retains  the  computational  efficiency  of  the  LPC  methods.  The  warped 
filter  essentially  performs  a similar  function  as  the  combination  of  root  extraction  and 
individual  scaling,  but  as  a more  implementable  DSP  approach. 

4.7  The  Gamma  Filter 

At  this  point,  it  is  of  interest  to  note  that  the  warped  filter  implementation  is  the 
application  of  a linear  filtering  technique  to  a transformed  signal  space.  This  suggests  a 
way  of  examining  the  signal  representation  through  the  analysis  of  basis  functions.  The 
topic  of  basis  functions  is  important  in  signal  processing  when  we  discuss  modelling  power 
and  approximation  accuracy.  We  are  familiar  with  FIR  and  HR  filters  as  structures  for 
signal  modelling,  in  regards  to  filter  order  and  memory  depth.  In  the  discrete-time  2:- 
domain  the  basis  functions  are  the  complex  exponentials  associated  with  the  Fourier  series 
expansion.  In  the  filter  of  Figure  4.18  the  basis  functions  are  the  Laguerre  kernels.  The 
representation  of  the  signal  space  is  essentially  the  projection  of  the  original  signal  onto 
these  basis  functions.  We  should  recall  that  that  such  a representation  is  a decomposition 


106 


technique,  and  attempts  to  represent  signals  in  terms  of  superposition,  or  linear  sums  of 
basis  functions.  The  resolving  power  of  a decomposition  is  determined  by  the  form  and 
number  of  the  basis  functions  in  regards  to  the  signal  characteristics.  Any  signal  can 
be  represented  by  an  infinite  set  of  basis  functions  [13].  The  representation  becomes  an 
approximation  problem  when  only  a finite  set  of  basis  functions  are  available,  and  the 
optimization  develops  when  the  bases  result  in  the  maximum  information  transfer. 

The  local  feedback  parameter  a of  Figure  4.18  injects  time  dispersion  which,  as  we  have 
shown,  allows  for  the  alteration  of  the  frequency  scale.  It  is  interesting  to  note  that  such  an 
approach  may  also  be  considered  a way  of  altering  the  projection  space.  In  the  ^-domain  the 
projection  operators,  or  basis  functions,  are  the  complex  exponentials.  The  term  operator 
is  used  to  denote  the  implementation  of  a delay  element  z~^  [63].  The  y-transformation  is 
another  such  operator  which  provides  a mathematical  framework  to  describe  gamma  filters 
as  conventional  filters  in  the  y-domain  [113].  The  y-domain  is  a generalized  feed-forward 
filter  defined  around  delay  operators  y“^  and  is  described  by 

y“^  = G{z)  (4.43) 

The  gamma  filter  is  a unique  filter  implementation  which  combines  the  desirable  properties 
of  both  FIR  and  HR  filters  [20].  The  local  feedback  loop  establishes  HR  properties,  and 
provides  the  modelling  power  for  non-stationary  signals  with  long  sustenance  in  time.  The 
FIR  properties  retain  the  implementation  stability  and  ease  of  adaptation.  The  gamma 
filter  is  a generalized  feed-forward  filter  and  differs  from  other  similar  architectures  such 
as  Laguerre  filters,  in  that  it  is  utilized  as  a short  term  memory  structure  for  temporal 
processing  in  neural  networks  [19].  The  gamma  filter  is  characterized  by  a local  feedback 
parameter  that  maps  the  basis  vectors  onto  a new  signal  space.  The  feedback  is  locally 
recurrent  and  allows  a conformal  mapping,  which  by  definition  preserves  the  angular  rela- 
tionship between  the  basis  vectors  [20].  The  gamma  bases  can  be  implemented  in  the  form 
of  a cascade  of  identical  low  pass  filters,  which  is  structurally  similar  to  that  of  Figure  4.18 
for  the  FIR  form.  The  discrete  gamma  delay  operator  evaluates  to 

2 - (1  -/i) 


G{z)  = 


(4.44) 


107 


and  replaces  the  unit-delays  of  a conventional  filter.  This  defines  the  7-filter,  and  is  de- 
scribed by  a family  of  gamma  bases 


9n{\t) 


A" 

(n-  1)! 


e 


A > 0,  t > 0,  A real 
n = 1, 2, 3, ... 


(4.45) 


as  shown  in  Figure  4.27  for  A = 1.  These  are  also  referred  to  as  the  gamma  kernels.  The 
generalized  feed-forward  7-filter  provides  a projection  space  that  has  been  shown  to  have 
better  modelling  capabilities  than  the  projection  space  provided  by  adaptive  FIR  filters 
[63]. 


Figure  4.27:  Gamma  bases  given  by  Eq(4.45). 


The  gamma  bases  are  linearly  dependent  though  not  orthogonal  [14].  The  gamma  bases 
constitute  a subset  of  eigenfunctions,  or  auto-regressive  moving  average  systems,  which  by 
virtue  are  able  to  well  capture  the  dynamics  of  a signal  which  accepts  the  linear  model  [13]. 
They  are  closely  related  to  the  Laguerre  polynomials,  in  that  the  Laguerre  polynomials 
can  be  derived  from  the  gamma  bases  by  applying  the  Gram-Schmidt  orthogonalization 
procedure  [14].  In  addition,  the  gamma  bases  are  a family  of  real  exponentials  as  noted  by 
Eq(4.45).  It  would  be  encouraging  to  relate  the  warped  filter  of  Eq(4.38)  to  the  gamma 
domain  since  a basis  of  real  exponentials  is  biologically  plausible,  whereas  the  realization 
of  complex  exponentials  is  not.  This  interpretation  suggests  a way  of  looking  at  real  basis 
functions  in  relation  to  biological  principles  and  their  functional  implementation.  However, 
it  is  not  straight  forwajrd  to  mathematically  relate  the  warped  filter  representation  with  the 


108 


gamma  bases.  The  warped  filter  poles  are  not  one  to  one  with  the  poles  in  the  gamma 
domain.  The  basis  functions  in  the  warped  filter  are  the  Laguerre  kernels,  which  were 
linearly  transformed  to  compensate  for  an  un-realizable  time  dependency. 

The  local  feedback  parameter  (1  — ;u)  of  the  7- filter  provides  a one  to  one  mapping  and 
only  affects  the  space  spanned  by  the  basis  vectors  [13].  It  is  a free  parameter  which  controls 
the  region  of  support  of  each  basis  in  time.  This  parameter  uncouples  the  memory  depth 
with  the  number  of  filter  weights  to  improve  the  modelling  accuracy  for  low  pass,  or  long 
impulse  response,  signals.  As  we  noted  earlier,  the  warped  filter  is  able  to  distribute  its 
poles  in  accordance  with  the  frequency  scale  set  by  a in  the  all-pass  elements.  The  feedback 
term  improves  the  modelling  approximation  by  a rotation  of  the  signal  space.  The  most 
attractive  quality  of  this  single  feedback  parameter  is  that  it  can  set  the  region  of  support. 
This  is  the  motivation  behind  the  gamma  transformation,  which  attempts  to  align  the  bases 
parallel  to  the  signal  space  [13].  This  is  the  optimal  separation  a projection  operator  seeks. 
The  linear  expansion  technique  applied  to  the  warped  signal  space  resembles  an  operation 
available  in  the  gamma  space.  In  addition,  the  feedback  factor  a of  the  warped  filter  sets  the 
scale  to  the  resolution  of  human  hearing.  In  the  gamma  domain,  generalized  feed-forward 
filters  are  ordinary  FIR  filters  defined  around  a delay  operator  7“^.  This  allows  standard 
FIR  filter  design  and  analysis  techniques  to  be  applied  in  the  gamma  domain.  For  this 
reason,  the  bandwidth  expansion  technique  of  the  warped  filter  portrays  the  motivation  of 
the  gamma  filter  and  can  be  examined  for  similarity. 

The  power  of  the  7-filter  for  temporal  modelling  lies  in  the  locally  recurrent  feedback 
parameter  (1  — /i).  Let  us  consider  approximating  an  unknown  linear  system  with  rational 
transfer  function  H (z)  with  a casual  FIR  by  expanding  the  transfer  function  in  a Laurent 
series  about  the  point  z = 0. 

N 

H{z)  = 'Y^Wk  z~^  (4.46) 

fc=o 

This  is  a power  series,  and  is  effectively  an  expansion  about  a pole  at  the  origin  [112]. 
According  to  the  theory  of  complex  functions,  if  the  expansion  is  conducted  about  any 
point  which  belongs  to  the  region  of  convergence  in  the  z~^  domain,  the  results  should  be 


109 


identical  {cited  in  [112]).  If  the  expansion  is  considered  about  the  real  pole  at  a where 
|a|  < 1 we  have  a cascade  of  identical  first  order  HR  filters  called  the  gamma  filter, 

N 

H{z)  = '^Wk  {z  - a)~^  (4.47) 

k=0 

Each  stage  in  the  7-filter  is  essentially  a low  pass  filter,  of  which  stability  is  guaranteed 
if  0 < H < 2 [112].  In  the  gamma  domain,  we  can  imagine  the  new  representation  of  the 
signal  as  the  projection  of  the  signal  on  the  basis  functions  defined  by  the  gamma  kernels. 
The  gamma  function  is  equivalent  to  the  Laurent  series  expansion  of  a signal  /i“”x(n) 
evaluated  at  a new  origin  70  = /i  — 1/ ^ [H3].  The  idea  is  displayed  in  Figure  4.28  where 
o = (1  — /r)  is  the  new  origin  in  Eq(4.47).  This  allows  us  to  visually  see  the  projection  of 


1 


Figure  4.28;  Relation  between  the  z and  7 domains. 

the  unit  circle  onto  the  gamma  domain  (inner  circle)  described  by  Eq(4.47).  The  angles  in 
the  gamma  space  are  projected  back  onto  the  unit  circle  (as  open  dots)  to  demonstrate  the 
nonlinear  placement  of  the  poles.  However,  they  cannot  be  directly  aligned  with  the  critical 
band  scale  given  by  the  Laguerre  kernels.  The  poles  were  previously  distributed  in  uniform 
increments  established  by  . The  additional  gain  term  /r  in  the  7-filter  is  included  to 
normalize  the  kernel  area  and  make  the  filter  unity  gain.  The  gamma  bases  were  inspired 
by  a probability  distribution  such  that  the  area  under  each  basis  is  equal  to  unity  [13].  This 


110 


is  advantageous  for  a filter  structure  whose  stability  is  of  concern  due  to  dynamic  range.  As 
we  noted  earlier,  certain  WIIR  configurations  were  proposed  to  retain  stability  for  higher 
order  analysis,  particularly  for  auditory  modelling.  The  low  pass  filter  structure  of  Strube 
used  in  the  warped  bandwidth  expansion  filter  of  Figure  4.18  is  stable  for  orders  below 
20-25.  Above  this,  the  the  filter  tap  values  of  the  higher  chains  grow  excessively  large.  The 
filter  structures  proposed  by  Steiglitz  and  Karjalainen  shown  in  Figure  4.29  alleviate  this 
problem  by  using  all-pass  sections  of  unity  magnitude  [127,  74]. 

Stable  high-order  WIIR  all-pole 


Figure  4.29:  Stable  higher  order  WIIR  filters. 

In  a similar  fashion,  the  7-filters  may  be  suitable  to  control  the  precision  requirements 
since  they  are  also  unity  gain.  If  we  consider  replacing  the  low  pass  sections  in  the  center 
delay  chain  of  Figure  4.18  we  can  constrain  the  tap  values  to  a stable  dynamic  range.  Figure 
4.30  shows  the  DC  gains  of  the  low  pass  and  7-filter  sections  for  placement  in  the  center 
tap  line.  We  can  replace  the  low  pass  sections  of  Figure  4.18  with  the  unity  gain  gamma, 
filters.  It  should  be  noted  that  the  centerline  tap  values  are  scaled  by  a power  series  of 
jj,  to  account  for  the  added  p gain  scale.  The  scaling  can  be  included  in  the  coefficients. 


Ill 


H(z)  = 


DC  Gain  = — = 1 


Figure  4.30:  Substitution  of  locally  recurrent  feedback  loop  with  gamma  kernel. 


The  filter  has  only  been  suggested  for  use  at  this  time  to  limit  the  tap  gain  dynamic  range. 
Figure  4.31  shows  the  tap  delay  values  for  both  the  low  pass  sections  and  gamma  sections 
of  a 20th  order  filter.  As  we  can  see  the  unity  gain  of  the  gamma-filters  should  provide  a 
more  reasonable  dynamic  range  for  higher  orders  since  the  tap  outputs  are  well  behaved. 


T 1 T 


Tap  number 


Figure  4.31:  Twenty  center  filter  tap  magnitude  values  for  a)  low-pass  element  of  Fig.  4.29 
and  b)  7 element  of  Fig.  4.30. 


CHAPTER  5 

OBJECTIVE  EVALUATIONS 


In  the  previous  chapters,  we  have  shown  that  vowel  loudness  monotonically  increases 
as  formant  bandwidth  increases,  and  that  loudness  saturates  when  the  resonant  structure 
dissolves  to  a flat  spectrum.  We  have  also  presented  studies  which  reveal  that  formant 
bandwidths  in  vowels  can  be  expanded  to  an  extent  without  severely  affecting  intelligibility. 
The  objective  of  this  research  has  been  to  enhance  the  loudness  of  speech  without  increasing 
speech  signal  energy  while  minimizing  intelligibility  loss.  In  this  chapter  we  provide  ISO- 
532B  analysis  results  using  the  filter  designs  of  chapter  4 as  an  objective  evaluation  of  the 
speech  processing  method.  The  ISO-532B  sone  level  ratio  describes  the  effective  increase  in 
loudness  of  speech  processed  by  the  warped  filter  for  various  values  of  the  warping  parameter 
a.  We  provide  simulations  to  show  that  the  optimal  warping  factor  for  increasing  loudness, 
at  equal  signal  energy,  corresponds  to  the  critical  band  scale.  Results  for  the  effective 
increase  in  loudness  are  presented  for  all  phoneme  distributions  of  the  1,681  sentences  of 
the  TIMIT  test  dataset.  The  voiced  regions  corresponding  to  the  vowels,  nasals,  and  glides, 
selected  as  candidates  for  formant  expansion,  will  be  seen  to  show  a more  prominent  increase 
in  loudness  in  comparison  to  the  unvoiced  regions. 

In  this  chapter  we  also  propose  a gain  function  that  relates  the  spectral  energy  distribu- 
tion to  the  effective  loudness  gain.  Results  are  provided  to  verify  the  effective  loudness  gain 
with  the  true  ISO-532B  analysis  and  loudness  approximation  presented  in  Chapter  2.  The 
approximation  errors  are  presented  for  all  phoneme  regions  of  the  TIMIT  test  set,  and  re- 
sults indicate  that  the  gain  function  is  a close  approximation  to  the  expected  loudness  gain 
of  the  warped  filter.  This  gain  tells  us  the  scaling  we  can  apply  to  elevate  the  original  signal 
to  equal  loudness  the  processed  signal.  We  also  extend  this  gain  function  to  an  all-pole 


112 


113 


loudness  ratio  and  relate  the  gain  in  loudness  to  the  evaluation  radius.  A loudness  distortion 
mccisure  is  also  proposed  as  an  illustration  to  simplify  the  gain  function.  Perceived  loudness 
is  approximately  logarithmic,  and  the  log  spectral  distance  appears  to  be  closely  related  to 
the  subjective  assessment  of  sound  differences  [114].  The  cepstral  distortion  measure  is  pre- 
sented to  relate  the  specific  loudness  distortion,  and  is  intimately  related  to  the  log  spectral 
distortion.  The  inclusion  of  a spectral  distortion  measure  within  the  loudness  ratio  reduces 
the  gain  to  a function  of  the  all-pole  model  and  bandwidth  expansion  factor.  We  complete 
this  chapter  with  an  objective  evaluation  of  the  loudness  enhancement  filter  as  a front  end 
process  to  machine  speech  recognition  performance.  Both  Dynamic  Time  Warping  (DTW) 
and  Hidden  Markov  Model  (HMM)  systems  are  used  to  evaluate  recognition  ratings. 

5.1  ISO-532B  Analysis 

There  are  no  direct  standards  to  date  for  measuring  non-stationary  loudness.  The  ISO- 
532B  measure  discussed  in  detail  in  Chapter  2,  however,  provides  an  analytic  evaluation 
of  loudness.  The  ISO-532B  incorporates  a psychoacoustic  model  of  hearing  and  a repre- 
sentation of  loudness  as  a measurable  value.  An  algorithmically  modified  word  or  sentence 
can  be  passed  through  the  ISO-532B,  and  a loudness  time  distribution  will  be  generated. 
The  loudness  plot  presents  the  sone  level  loudness  of  each  analyzed  speech  frame  over  the 
complete  length  of  the  word  or  sentence.  It  is  a concatenation  of  the  individual  ISO-532B 
loudness  values  and  describes  the  change  in  loudness  over  the  duration  of  the  utterance. 
The  analytic  value  of  loudness  however  is  not  estimated  over  a time  average  of  the  loudness 
function  and  such  a measure  is  not  available  [149].  It  is  difficult  to  predict  how  human 
listeners  will  evaluate  the  loudness  of  non-stationary  sounds  which  convey  information  such 
as  speech.  In  chapter  3 we  noted  that  the  overall  loudness  of  a sentence  is  closer  to  the 
average  peak  loudness  than  the  average  loudness.  We  also  noted  that  sounds  with  strong 
temporal  fluctuations  contribute  to  an  elevated  perception  of  loudness,  and  the  temporal 
ordering  of  complex  sounds  has  an  effect  on  perceived  loudness.  Since  the  algorithm  targets 
the  voiced  regions  of  speech  the  loudness  of  these  regions  will  be  greater  in  proportion  to 
the  non-voiced  regions.  Speech  attributes  such  as  clarity,  presence,  coloration,  and  fidelity 


114 


inadvertently  contribute  to  the  listeners’  perception  of  loudness.  The  consistency  and  bal- 
ance of  these  attributes  are  inherently  adjusted  in  a speech  processing  technique  which  is 
required  to  not  increase  speech  energy.  In  chapter  6 we  conduct  listening  tests  to  evaluate 
the  subjective  quality  and  intelligibility  of  the  processed  speech  since  these  percepts  are 
intimately  related  to  loudness.  The  ISO-532B  analysis  is  used  here  to  evaluate  the  overall 
change  in  loudness  for  our  candidate  speech  processing  algorithms. 

In  the  following  sections  the  ISO-532B  is  used  to  validate  the  warping  factor  and  to  eval- 
uate the  approximation  of  a loudness  gain  function.  We  have  stated  in  the  previous  chapters 
that  the  bandwidth  expansion  technique  in  the  warped  filter  design  should  increase  loud- 
ness. We  have  assumed  this  to  be  true  from  models  of  hearing,  models  of  speech  production, 
auditory  physiology,  and  from  provided  experimental  studies  on  auditory  perception.  Our 
research  of  auditory  filters,  masking  effects,  excitation  functions,  and  critical  bands  have 
revealed  that  loudness  will  increase  as  energy  exceeds  a critical  band.  The  filter  design  of 
Chapter  4 was  based  on  this  motivation,  and  the  warped  filter  was  designed  for  expansion 
on  a critical  band  scale.  Up  to  this  point,  however,  we  have  not  analytically  verified  these 
observations,  or  shown  that  such  a filter  accomplishes  this  task.  Also,  prior  to  selecting 
the  warped  filter  design  as  our  loudness  enhancement  technique,  other  methods  were  also 
investigated  in  an  attempt  to  increase  loudness.  Two  of  these  methods  were  named  ’energy 
redistribution’  and  ’temporal  modification’.  Both  these  methods  attempted  to  increase 
loudness  without  increasing  overall  signal  energy.  The  ISO-532B  analysis  was  investigated 
as  an  evaluation  tool  to  determine  which  algorithms  were  capable,  and  good  candidates,  for 
increasing  loudness.  The  formant  expansion  method  has  been  the  most  rewarding,  and  is 
the  topic  of  this  thesis.  ISO-532B  analysis  results  for  all  three  methods  are  presented  simply 
for  demonstration,  and  to  show  the  benefit  of  the  time  loudness  distribution  patterns. 

To  better  understand  the  results  of  the  ISO-532B  analysis,  we  review  the  definition  of 
the  sone  & phon  loudness  levels,  the  equal  loudness  contours,  and  the  power  law  of  hearing. 
Recall,  the  reference  loudness  of  a 40-dB  IKHz  tone  corresponds  to  a loudness  of  1 sone.  A 
10-dB  increase  of  this  reference  approximately  corresponds  to  a doubling  of  loudness.  This 
is  in  response  to  the  power  law  of  hearing,  which  occurs  for  levels  above  30dB.  Below  this 
level,  as  was  discussed  in  chapter  2,  the  loudness  rapidly  decreases,  and  does  not  follow  the 


115 


Energy  Redistribution 


Figure  5.1:  Energy  redistribution  a)  time  plot  and  b)  corresponding  ISO-532B  loudness 
pattern. 

power  law  of  hearing.  For  a IKHz  tone,  lOdB  corresponds  to  10-phon,  and  an  increment 
of  10-phon  corresponds  to  a doubling  of  loudness.  The  equal  loudness  contours  are  phon 
contours  at  equal  loudness  and  any  10-phon  increase  along  the  contour  implies  a doubling 
of  loudness,  or  equivalently  doubling  the  sone  level.  Thus  an  increment  of  20-phon  from 
the  40dB  reference  corresponds  to  a loudness  of  4 sone,  an  increment  of  30-phon  from  the 
reference  corresponds  to  a loudness  of  8 sone,  and  so  on,  up  to  about  128  sone.  This 
suggests  there  are  around  7 doublings  (2^  = 128)  of  subjective  loudness  from  the  40-dB 
reference.  This  also  provides  us  with  a way  to  evaluate  the  relative  change  in  loudness  of 
two  ISO-532B  loudness  patterns.  The  effective  loudness  increcise  is  the  ratio  of  the  sone 
levels. 

Figure  5.1  shows  the  time  domain  plots  of  a TIMIT  test  sentence  and  the  loudness 
distribution  patterns  of  the  energy  redistribution  method,  respectively.  The  energy  redis- 
tribution method  removes  energy  from  the  voiced  regions  and  applies  it  to  the  unvoiced 
regions  in  real  time.  We  can  partially  see  these  effects  in  the  speech  pattern  in  Figure 
5.1a.  The  loudness  plot  of  Figure  5.1b  clearly  shows  the  respective  increeises  and  decreases 


116 


in  the  sone  levels  in  voiced  and  unvoiced  regions  from  the  energy  redistribution.  Since 
the  effective  loudness  increase  corresponds  to  the  sone  ratio,  we  clearly  see  a pronounced 
loudness  gain  in  the  former  lower  energy  regions,  in  comparison  to  the  small  decrease  in 
the  high  energy  regions.  If  we  consider  the  principle  of  loudness  summation,  and  compare 
the  loudness  areas,  we  would  find  that  total  sone  level  is  approximately  unchanged.  The 
energy  redistribution  technique  is  an  appealing  technique  for  improving  intelligibility,  but  it 
is  difficult  to  evaluate  the  increase  in  loudness  from  the  ISO-532B  analysis.  The  ISO-532B 
does  not  consider  temporal  integration  effects  and  only  listening  tests  could  validate  the 
improvement  in  effective  loudness. 

It  is  similar  in  nature  to  methods  of  speech  compression.  Speech  compression  algorithms 
effectively  reduce  the  dynamic  amplitude  range  of  speech,  and  favorably  improve  vocal 
presence  and  loudness  with  good  success.  However,  they  are  not  equal  energy  based,  and 
considerably  increase  the  overall  signal  energy.  Normalization  of  the  energy  is  not  an 
option  since  this  effectively  reduces  the  loudness  and  dynamic  range  of  the  voiced  speech 
regions.  Speech  compression  methods  operate  either  on  the  time  signal  envelope  or  on  the 
modulation  envelopes  of  the  frequency  spectrum  [90,  119].  In  the  time  domain,  regions  of 
high  amplitude  speech  are  compressed,  and  regions  of  low  amplitude  speech  are  amplified. 
A gain  table  is  usually  employed  with  a few  first  order  filters  to  set  the  attack  time,  rise 
time,  and  decay  rate.  Multiband  speech  compressors  are  similar  in  concept,  but  operate 
on  the  time  trajectories  of  the  spectral  envelopes.  They  are  filterbank  compressors  which 
operate  on  the  frequency  amplitude  time  trajectories,  instead  of  compressing  the  amplitude 
temporal  envelope. 

Figure  5.2  shows  the  time  domain  plots  of  a TIMIT  test  sentence  and  the  loudness 
distribution  patterns,  for  the  temporal  modification  method,  respectively.  The  objective 
of  this  method  was  to  temporally  expand  and  compress  certain  regions  of  speech.  Vowels 
could  be  shortened  without  sacrificing  loudness  in  accordance  with  the  temporal  integration 
theory  of  hearing.  And,  the  energy  gain  could  be  applied  to  temporally  expand  or  amplify 
the  transient  regions.  These  unvoiced  regions  usually  correspond  to  consonant  locations 
usually  associated  with  phonetic  cues  to  emphasize  sound  differences.  The  Synchronized 
Overlap  and  Add  (SOLA)  algorithm  was  used  for  time  modification.  In  Figure  5.2a  we  see 


117 


Temporal  modification 


Figure  5.2:  SOLA  temporal  modification  a)  time  plot  and  b)  corresponding  ISO-532B 
loudness  pattern. 

that  some  of  the  high  energy  vowel  regions  have  been  temporally  compressed  and  the  low 
energy  regions  expanded.  This  also  changes  the  balance  of  energy,  for  which  with  the  energy 
savings  are  distributed  to  the  lower  energy  regions.  In  terms  of  temporal  modification  and 
resulting  quality,  the  algorithm  performance  was  impeccable.  We  coded  the  algorithm  in 
assembly,  and  successfully  embedded  it  in  the  Motorola  12000  cellular  phone  for  product 
demonstration.  It  is  used  in  VoiceNotes®  recording  to  speed  up  or  slow  down  the  playback  of 
voice  messages.  However,  for  use  as  a loudness  enhancement  technique,  results  with  the  ISO- 
532B  were  difficult  to  substantiate.  On  examination  of  the  loudness  distribution  pattern 
it  is  difficult  to  judge  whether  an  increase  in  loudness  is  evident,  in  addition  to  changing 
the  speech  prosody.  The  time  patterns  are  no  longer  aligned,  and  the  ISO-532B  does  not 
consider  effects  due  to  temporal  integration.  In  addition,  the  real  time  implementation  of 
the  temporal  modification  routine  requires  a precise  speech  segmentation  algorithm,  large 
memory  buffers,  and  is  extraordinarily  complex. 

Figure  5.3  shows  the  ISO-532B  loudness  analysis  for  a sentence  modified  by  the  formant 
expansion  method  in  comparison  to  the  original  sentence.  This  is  the  method  used  in  the 


118 


Formant  Expansion 


Figure  5.3:  Formant  expansion  a)  time  plot  and  b)  corresponding  ISO-532B  loudness  pat- 
tern. 

design  of  the  warped  filter.  The  ISO-532B  loudness  curve  clearly  demonstrates  a boost  of  the 
modified  sentence  (solid)  in  comparison  to  the  original  signal  (dotted).  It  can  be  noted  that 
the  modified  sentence  shows  a higher  loudness  curve.  For  this  reason  the  formant  bandwidth 
expansion  technique  was  selected  as  the  speech  processing  technique  for  increasing  loudness. 
The  numeric  ratio  of  these  sone  curves  describes  the  relative  increase  in  loudness  i.e.,  a sone 
ratio  of  12/8  implies  a loudness  gain  of  1.5.  Average  loudness  ratios  will  be  provided  for  all 
phoneme  distributions  of  the  TIMIT  test  dataset  in  a following  section.  It  should,  however, 
be  noted  that  the  loudness  gain  is  dependent  on  the  speech  signal  characteristics.  The 
bandwidth  expansion  produces  a nonuniform  loudness  gain  across  the  sentence.  The  vowel 
regions  are  amplified  in  loudness  greater  than  the  unvoiced  regions  since  the  voiced  regions 
are  more  resonant.  Figure  5.3  was  presented  as  an  example  of  loudness  analysis  with  the 
ISO-532B  for  linear  bandwidth  expansion.  In  the  next  section  we  examine  the  contribution 
of  loudness  due  to  bandwidth  expansion  on  a nonlinear  frequency  scale. 


119 


5.1.1  The  Optimal  Warping  Factor 

In  the  previous  chapter  we  discussed  the  role  of  all-pciss  elements  in  warped  linear  pre- 
diction and  warped  filter  design.  We  noted  that  the  all-pass  elements  induce  time  dispersion 
and  allow  for  non-linear  frequency  warping.  Eq(4.25)  revealed  the  all-pass  warping  factor, 
a,  necessary  to  approximate  the  critical  band  scale.  The  warping  factor  is  dependent  on 
sampling  frequency,  and  for  sampling  frequencies  of  lOKHz  and  16KHz  the  warping  factor 
is  known  to  be  a = 0.47  and  a = 0.56,  respectively.  The  premise  of  the  warped  filter  design 
has  been  to  bandwidth  broaden  speech  formants  on  a critical  band  scale.  Loudness  is  known 
to  increase  when  a critical  band  is  exceed,  and  the  principle  of  the  filter  design  is  to  increase 
bandwidths  on  a critical  band  scale.  Practically  of  course,  and  previously  mentioned,  there 
is  a certain  degree  to  which  bandwidths  can  be  expanded  before  intelligibility  is  sacrificed. 
By  virtue  of  the  auditory  system,  we  would  expect  the  critical  band  warping  factor  to  give 
us  the  greatest  increase  in  loudness. 

The  warping  factor  a of  the  all-pass  element  in  Eq(4.21)  is  a free  parameter  that  sets  the 
warped  frequency  scale  in  the  warped  filter  of  Eq(4.38).  We  can  evaluate  the  effects  of  the 
warped  filter  on  speech  loudness  for  all  values  of  a between  -1  and  -|-1.  We  have  available 
the  ISO-532B  analysis  and  the  loudness  approximation  of  Chapter  2.  For  the  evaluation  it 
is  necessary  to  only  use  one  of  the  loudness  analysis  functions,  and  we  use  the  ISO-532B. 
The  value  of  a which  gives  the  greatest  loudness  on  average  for  the  vowel  phoneme  category 
of  speech  will  be  considered  the  optimal  value.  The  following  analysis  was  conducted  to 
evaluate  the  change  in  loudness  given  a.  The  warped  bandwidth  filter  of  Eq(4.38)  was  used 
to  process  all  1,681  test  sentences  of  the  TIMIT  database.  Each  sentence  was  processed  as 
follows:  Frame  sizes  of  32ms,  10th  order  WLPC  analysis,  warped  filtering  with  the  complete 
filter  of  Eq(4-38),  50%  overlap  and  add  with  banning  window,  fi  = 0.4,  and  7 adjusted 
between  0.4  < 7 < 0.85  as  a function  of  tonality.  Overlap  and  add  was  necessary  to 
alleviate  the  discontinuities  at  frame  boundaries  due  to  the  new  filter  coefficients  generated 
for  each  frame.  Sub-frame  interpolation  was  not  employed  in  this  study.  In  addition, 
filter  states  were  preserved  between  frames  to  reduce  boundary  artifacts,  though  overlap 
and  add  essentially  accomplished  the  same  feat.  The  spectral  flatness  measure  (SFM)  of 


120 


Eq(3.3)  was  used  to  determine  the  tonality,  and  a linear  ramp  function  was  used  to  set  the 
evaluation  radius  based  on  this  value.  We  only  want  to  bandwidth  broaden  vowel  regions 
of  speech  because  of  their  high  energy  content  and  smooth  spectral  envelope.  An  SFM  of 
1 indicates  complete  tonality  (such  as  a sine  wave)  and  an  SFM  of  0 indicates  non-tonality 
(such  as  white  noise).  For  a tonal  signal  such  as  a vowel,  we  want  the  maximum  bandwidth 
expansion,  so  7 = 0.85.  For  non-tonal  speech,  we  want  a minimal  contribution  of  the 
warped  filter,  so  we  set  7 = 0.4.  The  SFM  values  between  0.6  and  1,  were  linearly  mapped 
to  0.4  < 7 < 0.8.  The  0.6  clip  was  set  to  primarily  ensure  that  tonal  components  were 
considered  for  formant  expansion. 

Table  5.1;  Dialect  regions  and  number  of  speakers  in  each  region. 


Region 

# 

Region 

# 

drl 

11 

dr5 

28 

dr2 

26 

dr6 

11 

dr3 

26 

dr7 

23 

dr4 

32 

dr8 

11 

Fach  sentence  in  the  TIMIT  test  database  was  processed  as  just  mentioned  and  evalu- 
ated for  total  loudness  using  the  ISO-532B.  Figure  5.5  shows  the  change  in  loudness  given 
a for  each  dialect  region  of  the  vowel  phonemes  in  the  test  set.  Fach  speaker  utters  10 
sentences,  and  the  total  number  of  speakers  in  each  dialect  region  is  given  in  Table  5.1. 
Thus  each  subplot  of  Figure  5.5  will  contain  a number  of  curves  equal  to  the  number  of 
speakers  in  that  region  times  ten.  Fach  plot  in  Figure  5.5  describes  the  change  in  loudness 
for  each  sentence  for  values  of  a equal  to  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  and  0.8.  The  change 
is  relative  to  the  loudness  at  a = 0.  For  visual  convenience.  Figure  5.4  presents  the  mean 
response  for  each  dialect  region.  The  brackets  are  the  standard  deviation  for  each  value  of 
a.  It  can  be  seen  from  this  figure  that  values  of  a between  0.4  and  0.6  provide  the  greatest 
positive  change  in  loudness.  These  values  are  reasonable  to  the  a = 0.56  value  for  critical 
band  approximation  at  fs  = 16KHz.  Thus,  the  results  favorably  show  that  critical  band 
expansion  provides  a greater  increase  in  loudness. 


121 


Figure  5.4:  A sweep  of  a to  determine  the  frequency  scale  for  the  optimal  gain  in  loudness. 
Shows  mean  a curves  for  dialect  regions  1 to  8,  and  the  corresponding  variance  delimited 
by  bars. 

5.1.2  Warped  Filter  Loudness 

Table  5.3  presents  the  loudness  increase  for  all  phoneme  regions  of  the  entire  Test 
TIMIT  dataset.  Results  are  reported  as  the  overall  average  for  each  phoneme  region.  The 
loudness  increase  is  presented  as  the  ISO-532B  sone  level  ratio  of  the  processed  speech  to 
the  unmodified  original  speech.  Each  sentence  was  processed  in  the  same  manner  as  the 
previous  section,  and  the  ISO-532B  analysis  was  conducted  over  speech  frame  lengths  of 
32ms.  Seven  phoneme  categories  are  shown;  stops,  affricates,  fricatives,  nasals,  vowels, 
glides,  and  others.  The  unvoiced  regions  are  the  stops,  affricates,  and  fricatives.  The  voiced 
regions  are  usually  considered  the  vowels,  nasals,  and  glides.  The  others  category  was 
defined  to  include  non-speech  type  activity  such  as  pauses,  breaths,  and  noise  bursts.  In 
each  category  there  are  5 columns:  the  phoneme  type,  the  total  number  of  occurrences  of 
the  phoneme  N , the  total  average  power  P of  the  occurring  phoneme  in  the  distribution, 
the  loudness  increase  for  linear  expansion  Ny/Nx,  and  the  loudness  increase  for  warped 
expansion  Ny/Nx,  where  Ny  represents  the  sone  level  of  the  processed  speech,  and  the  tilde 
sign  represents  warped  expansion  with  a — 0.5.  The  ‘mean’  row  represents  the  mean  of  all 
occurrences  of  that  category  type,  not  the  mean  of  the  column,  since  that  represents  the 
mean  of  a mean.  The  ’total’  is  either  the  total  number  of  occurrences  for  that  category  or 
the  total  power  as  a percent  of  the  total  power  for  that  category.  Loudness  ratios  for  the 


122 


f1R1  aloha 


riR.l  aloha 


riRfi  aloha 


dR7  alpha 


0'2  0.4  0.6  0.8 

dR2  aloha 


0-2  0.4  0.6  0.8 


rlR4  aloha 


0.2  0.4  0.6  0.8 

riRR  aloha 


0'2  0.4  0.6  0.8 

dR8  alpha 


Figure  5.5:  A sweep  of  the  warping  factor  a versus  the  average  change  in  loudness  for  vowel 
phonemes  of  each  sentence  in  all  8 dialect  regions  of  the  TIMIT  test  dataset.  Sentence 
numbers  are  along  the  axis  projecting  into  the  page. 


123 


linear  expansion  and  warped  expansion  are  adjacent  for  comparison  purposes  in  Table  5.3. 
Results  show  that  the  warped  model  provides  a slightly  higher  loudness  increase  than  the 
linear  expansion,  and  the  difference  is  greatest  for  the  voiced  regions  of  speech  as  noted 
in  Table  5.2.  Intuitively,  from  our  review  of  hearing  physiology,  we  expect  the  loudness  to 
be  greater  on  a critical  band  scale.  However,  we  are  unable  to  substantiate  the  loudness 
gain  for  a particular  frequency  scale  in  regards  to  intelligibility.  No  measures  are  available 
to  evaluate  the  loss  of  intelligibility,  and  we  thus  cannot  normalize  the  contribution  of  the 
loudness  gain  with  respect  to  intelligibility. 

Table  5.2:  Phoneme  categories  and  the  loudness  gain  difference  between  the  warped  and 
linear  bandwidth  expansion  filters. 


Category 

Nr. 

Category 

Ny-Nr 

Nr 

vowels 

0.14 

stops 

0.8 

nasals 

0.11 

affricates 

0.02 

glides 

0.12 

fricatives 

0.05 

These  results  explicitly  state  the  average  loudness  level  increase  of  the  sentences  pro- 
cessed by  the  warped  filter  with  a = 0 and  a = 0.5.  Unfortunately,  however,  the  loudness 
gain  only  reflects  how  much  louder  one  speech  sound  is  in  comparison  to  another.  It  does 
not  tell  us  the  equivalent  power  gain  provided  by  the  speech  enhancement  technique.  In 
developing  a loudness  enhancement  filter  it  would  be  pleasing  to  know  the  effective  dB 
gain  introduced  by  the  warped  filter.  Due  to  the  nature  of  speech  and  to  the  filter  model, 
however,  we  should  recognize  that  different  phoneme  categories  as  well  as  speaker  charac- 
teristics result  in  different  loudness  ratios.  We  can  ascribe  an  average  loudness  ratio  to  each 
category,  but  this  still  does  not  tell  us  the  equivalent  power  savings.  In  the  next  section, 
we  investigate  this  question,  and  propose  a loudness  gain  function  which  provides  a close 
approximation  to  the  actual  gain  savings,  as  validated  by  the  ISO-532B  analysis. 


124 


Table  5.3:  TIMIT  TEST  phoneme  occurrences  (N),  power  (P%),  Linear  (a  = 0)  loudness 
increase  (number  of  times  louder)  Ny/N^  , Warped  (a  = 0.5)  loudness  increase  Ny/N^^. 


Stops 

N 

P 

NylN, 

Ny/N, 

Affricates 

N 

P 

Ny/N, 

Ny/N, 

b 

1852 

0.13 

1.35 

1.51 

jh 

542 

0.45 

1.08 

1.10 

d 

1930 

0.13 

1.18 

1.24 

eh 

527 

1.30 

1.07 

1.09 

g 

1705 

0.11 

1.23 

1.35 

mean 

1.07 

1.09 

P 

1933 

0.15 

1.19 

1.27 

total 

1069 

0.276* 

t 

2257 

0.35 

1.08 

1.11 

k 

2130 

0.29 

1.12 

1.17 

Fricatives 

N 

P 

Ny/N, 

dx 

1793 

0.27 

1.22 

1.36 

s 

2928 

2.88 

1.09 

1.18 

q 

2035 

0.86 

1.22 

1.35 

sh 

1863 

2.36 

1.07 

1.11 

bcl 

1816 

0.03 

1.18 

1.27 

z 

2076 

1.02 

1.10 

1.15 

del 

2284 

0.03 

1.11 

1.18 

zh 

1642 

0.78 

1.07 

1.08 

gel 

1696 

0.03 

1.14 

1.24 

f 

1906 

0.12 

1.10 

1.12 

pel 

1950 

0.01 

1.07 

1.12 

th 

1659 

0.08 

1.10 

1.12 

tel 

2718 

0.02 

1.06 

1.11 

V 

1789 

0.19 

1.24 

1.33 

kcl 

2433 

0.01 

1.07 

1.12 

dh 

1882 

0.12 

1.28 

1.35 

mean 

1.16 

1.24 

mean 

1.13 

1.18 

total 

28532 

1.245* 

total 

15745 

3.84* 

Vowels 

N 

P 

Ny/N, 

Ny/N^ 

Nasals 

N 

P 

Ny/N, 

NyjN. 

iy 

3049 

4.72 

1.20 

1.35 

m 

2134 

0.36 

1.34 

1.47 

ih 

2342 

6.41 

1.22 

1.37 

n 

2809 

0.37 

1.26 

1.37 

eh 

2147 

12.58 

1.20 

1.35 

ng 

1673 

0.28 

1.26 

1.37 

ey 

1853 

10.51 

1.18 

1.33 

em 

1628 

0.20 

1.30 

1.42 

ae 

2214 

22.20 

1.18 

1.30 

en 

1641 

0.20 

1.32 

1.44 

aa 

2023 

24.42 

1.28 

1.42 

eng 

1627 

0.20 

1.38 

1.49 

aw 

1690 

19.81 

1.25 

1.37 

nx 

1674 

0.37 

1.17 

1.27 

ay 

1831 

18.78 

1.24 

1.36 

mean 

1.29 

1.40 

ah 

1938 

12.38 

1.25 

1.38 

total 

13186 

0.675* 

ao 

2050 

16.79 

1.35 

1.48 

oy 

1689 

13.79 

1.30 

1.44 

Glides 

N 

P 

NyjN, 

Ny/N, 

ow 

1782 

13.15 

1.29 

1.42 

1 

2706 

5.17 

1.32 

1.45 

uh 

1712 

7.14 

1.29 

1.43 

r 

2842 

6.89 

1.21 

1.38 

uw 

1689 

3.59 

1.32 

1.48 

w 

2114 

2.75 

1.34 

1.50 

ux 

1745 

3.02 

1.25 

1.42 

y 

1806 

1.01 

1.24 

1.39 

er 

1871 

8.51 

1.17 

1.37 

hh 

1716 

0.15 

1.17 

1.23 

ax 

2209 

2.40 

1.28 

1.40 

hv 

1683 

0.79 

1.21 

1.29 

ix 

3285 

2.28 

1.22 

1.36 

el 

1713 

3.92 

1.33 

1.46 

axr 

2213 

3.33 

1.19 

1.36 

mean 

1.26 

1.38 

axh 

1695 

0.05 

1.15 

1.20 

total 

14580 

12.034* 

mean 

1.24 

1.38 

total 

41027 

81.926* 

125 


5.1.3  Equating  Energy  to  Loudness 


The  loudness  filter  is  given  by  the  following  relation  where  A{z)  represents  the  linear 
prediction  coefficients,  A{z/P)  represents  the  bandwidth  expanded  version,  and  S{z)  is  the 
speech  signal. 


S{z)  = S{z) 


A{z/P) 


The  constraint  is  to  keep  the  overall  power  level  the  same. 


(5.1) 


but  make  it  sound  louder 


Ns  > Ns 


What  we  would  like  to  know  is  how  much  power  must  we  add  to  the  original  signal  to 
achieve  the  same  loudness  level,  i.e.,  we  want  to  scale  up  the  original  signal  s(n)  by  a gain 
factor  g until  it  sounds  as  loud  as  the  processed  signal  s{n).  By  Parseval’s  theorem  the  gain 
term  is  transferable  to  the  frequency  domain.  We  allow  the  power  to  go  up  since  we  are 
looking  for  equal  loudness,  hence  the  >. 


9 


dw 


J_^  ' ' ''  2tt 

Our  objective  is  to  find  the  gain  which  gives  us  equal  loudness. 


(5.2) 


Ns  = N-s 


If  we  utilize  Moore’s  model  of  loudness  we  can  set  up  a system  of  equations  to  solve  for  the 
gain  term  that  gives  us  equal  loudness.  Moore’s  model  of  loudness  is  given  by 


N'  = c 


E\^  (Etq\^ 

eJ  [ Eo  J 


(5.3) 


For  moderate  to  high  levels  of  speech  we  can  assume  the  excitation  at  threshold  is 
negligible  [51].  Eq  is  the  reference  excitation  produced  by  a sound  at  OdB  SPL,  and  c is  a 


126 


constant.  The  specific  loudness  is  given  by 


N'  = cE'^  (5.4) 

The  specific  loudness  of  S{z)  (the  bandwidth  expanded  version)  and  Sg{z)  (the  scaled 
original)  can  be  respectively  described  by 

N’s  = cEl 
K = CE^g 


What  we  desire  is  to  determine  at  what  gain  do  we  achieve  equal  loudness.  This  can  be 
expressed  by  setting  the  loudness  ratio  to  unity 


N'  cE'^ 

1 _ _ ^^9 

NL  cE'l 


(5.5) 


The  constant  c drops  out  out  of  the  equation  for  unity.  The  compressive  non-linearity 
cannot  be  pulled  out  of  the  equation  since  the  total  excitation  is  a sum  of  this  power  series. 


(5.6) 


If  we  assume  no  masking,  the  power  spectrum  gain  is  directly  transferred  to  the  excitation 
E{z).  The  gain  is  squared  since  it  is  in  the  power  domain,  g will  be  the  scale  factor  to  the 
speech  time  signal.  Also,  the  overall  loudness  N is  the  sum  of  the  specific  loudnesses  N' . 


Y^g^ 


(5.7) 


Since  we  have  previously  shown  that  82%  of  the  power  in  vowels  is  unmasked,  we  can  include 
the  assumption  that  no  masking  occurs.  If  no  masking  occurs,  there  is  no  smearing  along 
the  critical  band  spectrum;  thus,  the  excitation  function  reduces  to  the  warped  critical 
band  power  spectrum,  |5u,(r(;)p  = |5(0(u;))p,  where  6 is  the  frequency  warping  function. 
To  model  loudnes  it  is  necessary  to  include  the  transmission  characteristics  of  the  outer  to 
middle  ear.  We  include  a warped  outer  to  middle  filter. 


(5.8) 


127 


Inserting  for  both  excitations  and  solving  for  g we  get 

_ J-K  ^ ^ /r  QS 

^ t-di/)  (5-9) 

j ^ (|S„(»)l"iH„(2)P)  I;: 

Eq(5.9)  is  essentially  a ratio  of  the  loudnesses,  excluding  the  excitation  due  to  marking.  We 
can  see  that  the  three  main  principles  of  psychophysics  are  included  in  this  equation:  critical 
band  scale,  power  law  of  compression,  and  the  outer  to  middle  ear  filter.  This  equation 
is  suitable  for  analysis  and  preliminary  results  indicate  g is  a good  approximation  to  the 
gain  necessary  for  equal  loudness.  We  can  use  the  loudness  approximation  developed  in 
Chapter  2 to  realize  this  equation  or  the  actual  ISO-532B  method.  Since  we  are  assuming 
little  masking  in  vowels,  we  have  bypassed  the  complex  excitation  function  and  masking 
slope  calculations.  The  loudness  approximation  allows  us  to  proceed  to  an  all-pole  model. 
Now  lets  substitute  the  all-pole  model  in  the  denominator  for  the  original  warped  spectrum 
Sw{z),  and  the  bandwidth  expanded  model  from  Eq(5.1)  in  the  numerator  for  the  warped 
spectrum 


Sw{w) 


A^{eJ^/P) 


2 

) dw 


<72 

Au,{e^^) 


k 

dw 


(5.10) 


We  should  notice  that  Sw{w)Aw{e^'^)  in  the  numerator  is  the  (prediction  error)  residual 
gain  a\  of  the  warped  all-pole  model.  This  gain  is  independent  of  the  integration  and 
can  be  removed  from  the  integration.  Essentially,  we  can  consider  first  creating  a warped 
bandwidth  expansion  all  pole  model  to  be  included  in  the  equation  a\ / / ^) . The 
prediction  error  can  be  solved  in  the  same  manner  as  the  residual  gain  in  the  LPC  analysis. 


r2  = alR,, 


(5.11) 


where  is  the  warped  autocorrelation  matrix  described  in  Eq(4.17),  and  are  the 
warped  prediction  coefficients  of  Eq(4.19).  Eq(5.11)  represents  the  convolution  of  the  speech 
with  the  warped  prediction  coefficients.  As  a filter  operation,  the  complete  convolution  must 
be  performed  i.e.,  the  final  tap  states  must  be  included  in  the  error  signal.  The  residual  is 


128 


then  the  sum  of  the  squares  of  the  error  signal.  The  residual  gains  ensure  that  Parseval’s 
energy  theorem  is  valid  for  the  time  domain  and  frequency  domain  energy.  Lets  bring  the 
constant  af  out  as  well  as  of  the  equation. 


ff 


2k 


Hwjw) 


Hyj{w) 


(5.12) 


It  should  be  noted  that  the  residual  gain  term  a\  calculated  from  Eq(5.11)  for  the  bandwidth 
expanded  model  does  not  completely  restore  the  original  energy.  The  residual  gain  term 
provides  complete  energy  restoration  only  when  the  optimal  prediction  coefficients  are  used 
[114].  In  the  real  time  system,  the  gains  are  determined  from  the  unfiltered  and  filtered 
speech  frames.  The  processed  signal  is  then  elevated  to  the  correct  gain  based  on  the  ratio. 
This  is  a standard  technique  for  the  short-term  postfilter  operation  [15]. 

To  numerically  evaluate  Eq(5.12),  it  is  necessary  to  generate  the  warped  outer  to  middle 
ear  transfer  function  Hyj{z).  A unique  way  to  warp  H{w)  described  by  Eq(2.20)  in  Chapter 
2 is  to  warp  the  impulse  response  h{n)  by  replacing  unit  delays  with  all-passes  and  then 
calculating  the  new  frequency  response.  A similar  warping  procedure  can  be  done  for  S{w) 
to  generate  and  and  then  solve  for  the  prediction  coefficients  in  the  normal 

way.  The  warped  outer  to  middle  ear  filter  used  in  Eq(5.13)  is  shown  in  Figure  2.16  as  the 
continuous  dotted  line.  The  actual  ISO-532B  and  approximation  used  the  discrete  critical 
band  attenuations,  not  a continuous  representation.  This  was  due  to  the  fact  that  the 
masking  slopes  were  calculated  from  a discrete  auditory  excitation  model.  The  auditory 
filters  were  validated  for  use  as  a continuous  excitation  model,  but  were  extremely  complex 
for  a practical  approximation  [109].  Fortunately,  we  are  evaluating  the  loudness  gain  of 
vowels  in  Eq(5.13),  and  the  excitation  slope  calculations  are  bypassed.  Finally,  the  relation 
can  be  expressed  in  the  Z domain  as. 


9 


2k 


N-1 


k=0 


Aw{z/P) 


2k 


2 — gj27rA;/Af 


N-l 

-IE 


Jfc=0 


Aw(z) 


2k 


^—^j2Trk/N 


(5.13) 


129 


Eq(5.13)  describes  the  gain  factor  required  to  elevate  a speech  signal  to  equal  loudness  of 
the  bandwidth  expanded  version.  We  should  note  a few  things  from  this  equation.  Since 
we  are  not  included  masking  effects,  and  the  all-pole  models  have  the  same  energy,  the 
only  increase  in  loudness  must  come  from  the  outer  to  middle  filter  and  compression  factor 
of  the  power  spectrum  along  the  critical  band  scale.  Eq(5.13)  also  tells  us  that  we  can 
estimate  this  gain  by  determining  the  WLPC  coefficients  and  the  power  series  expansion  of 
the  WLPC  coefficients.  This  gain  is  dependent  on  the  signal  characteristics  of  the  speech 
signal.  Inherently,  it  also  tells  us  the  expected  dB  gain  we  can  expect  for  a bandwidth 
expanded  signal  to  achieve  equal  loudness,  at  the  same  power  level  as  the  original  signal. 
This  is  the  general  relation  we  have  been  seeking.  Thus  the  value  of  g is  the  gain  necessary 
to  scale  the  original  signal  to  equal  loudness  and  sets  the  dB  increase  as 

GAINrfB  = lOlogio(ff)  (5.14) 


5.1.4  Results 

Tables  5.4  through  5.7  present  the  loudness  gain  described  by  Eqs(5.13)  & (5.14)  for  all 
phoneme  regions  of  the  TIMIT  test  dataset.  Each  sentence  was  processed  on  a frame  by 
frame  basis  using  the  analysis  described  in  section  5.1.1.  Each  table  provides  the  loudness 
increase,  Ny/N^,  the  associated  decibel  increase  described  by  Eq(5.14),  dBgain,  and  the 
approximation  error  of  this  increase,  E.  The  decibel  gain  is  that  gain  which  can  be  applied 
to  the  original  speech  signal  to  sound  as  equally  loud  cis  the  processed  (bandwidth  expanded) 
speech  signal,  and  where  both  signals  are  at  equal  energy.  The  approximation  error  is  the 
sone  level  percent  difference  between  the  bandwidth  expanded  speech  and  the  gain  scaled 
original  speech.  It  is  described  by  E = |1  — Ny/Ngx\,  where  Ngx  represents  the  sone  level 
of  the  gain  scaled  original  signal.  When  the  gain  scaled  original  speech  is  equally  loud  to 
the  bandwidth  expanded  speech  the  ratio  Ny/Ngx  = 1 and  E = 0.  The  true  ISO-532B  is 
used  to  calculate  the  approximation  error  E for  all  error  results. 

Tables  5.4  & 5.5  present  the  results  for  the  linear  bandwidth  expansion  model.  The 
second  two  tables.  Tables  5.6  & 5.7,  present  the  results  for  the  warped  bandwidth  expansion 
model.  The  first  table  of  each  pair  uses  the  true  ISO-532B  analysis  to  determine  the 


130 


equivalent  loudness  scaling  gain.  Both  the  numerator  and  denominator  of  Eq(5.13)  are 
replaced  with  the  true  ISO-532B  loudness  model.  This  allows  a formal  measure  of  the  gain 
approximation,  and  serves  as  a reference  for  the  loudness  approximation.  It  also  allows  us 
to  scale  the  original  signal  by  the  resulting  gain  and  evaluate  the  sone  error.  The  second 
table  of  each  pair  use  the  loudness  approximation  described  in  section  2.4.3  to  determine  the 
equivalent  loudness  scaling  gain.  The  loudness  approximation  was  presented  as  a substitute 
to  the  complex  ISO-532B  analysis  for  real  time  processing.  The  numerator  and  denominator 
of  Eq(5.13)  each  represent  this  loudness  approximation. 

Linear 

Table  5.4  presents  the  linear  bandwidth  expansion  loudness  increase  defined  by  Eq(5.13) 
and  the  decibel  gain  using  Eq  (5.14)  for  the  true  ISO-532B  analysis.  An  associated  error 
determined  by  the  true  ISO-532B  is  presented  to  provide  the  sone  level  error  difference 
between  the  processed  signal  and  the  gain  scaled  original.  We  anticipated  the  loudness  gain 
would  be  greatest  for  the  voiced  regions  of  speech.  These  correspond  to  the  vowels,  glides, 
and  nasals.  The  mean  dB  loudness  gain  for  linear  bandwidth  expansion  is  determined  to  be 
1.4dB  with  an  approximation  error  of  0.6%.  So,  at  a reference  loudness  of  Ny  = 16  sones 
a 0.6%  error  roughly  corresponds  to  an  error  of  1.0  sones  in  Ng^.  For  this  value  in  Table 
5.4,  the  approximation  states  that  we  can  scale  up  the  original  signal  by  1.4dB  with  a sone 
percent  error  difference  of  |1  — Ny/Ngx\  = 0.6  between  the  bandwidth  expanded  signal  and 
the  gain  scaled  original  signal.  It  should  be  noted  that  we  can  calculate  the  exact  scale 
value  to  achieve  zero  error  for  every  speech  utterance.  Since  we  are  using  the  true  ISO- 
532B  we  can  iteratively  scale  down  the  processed  speech  and  evaluate  the  loudness  level 
until  the  sone  levels  is  equal  to  the  unprocessed  speech.  However,  this  is  impractical  and  the 
approximation  which  we  have  previously  shown  to  be  a good  approximation  is  sufficient  for 
evaluating  the  error.  Interestingly,  the  glides  and  nasals  have  a slightly  higher  (about  0.1 
to  0.2,  respectively)  dB  gain  than  the  vowel  regions.  In  contrast,  the  unvoiced  regions  have 
a lower  effective  dB  gain.  The  stops,  affricates,  and  fricatives,  have  respective  dB  gains  of 
0.9,  0.4,  and  0.8  dB.  Results  are  summarized  in  Table  5.8. 

We  hoped  the  unvoiced  regions  would  show  very  little  dB  gain  improvement.  The 
filter  is  designed  to  bandwidth  expand  formant  regions  of  speech.  However,  this  requires  a 


131 


Table  5.4;  Equal  energy  phoneme  gains  for  linear  expansion  a = 0 with  the  true  ISO-532B 
for  the  1,681  sentences  of  the  TIMIT  test  set.  The  ratio  Ny/Nx  is  the  loudness  increase 
of  the  enhanced  phoneme  to  the  original  (number  of  times  louder),  dBgain  is  the  gain  from 
Eq(5.13)  required  to  scale  up  the  original  to  achieve  equal  loudness,  Ny/Ngx  is  the  loudness 
increase  of  the  enhanced  to  the  scaled  original  and  E = \l  — Ny!Ngx\  is  the  approximation 
error  of  dBgaj„. 


Stops 

Ny/Nx 

dBpaj,.; 

E 

Aflfricates 

Ny/Nx 

E 

b 

1.35 

1.9 

4.4 

jh 

1.08 

0.5 

1.3 

d 

1.18 

1.1 

2.4 

ch 

1.07 

0.4 

0.5 

g 

1.23 

1.3 

4.1 

mean 

1.07 

0.4 

0.9 

P 

1.19 

1.1 

3.0 

t 

1.08 

0.5 

1.1 

k 

1.12 

0.7 

1.7 

Fricatives 

Ny/Nx 

dBgQjTj 

E 

dx 

1.22 

1.3 

2.3 

s 

1.09 

0.6 

1.3 

q 

1.22 

1.3 

2.7 

sh 

1.07 

0.4 

0.6 

bcl 

1.18 

1.0 

5.7 

z 

1.10 

0.6 

2.5 

del 

1.11 

0.7 

6.5 

zh 

1.07 

0.4 

1.4 

gel 

1.14 

0.8 

6.9 

f 

1.10 

0.6 

2.9 

pci 

1.07 

0.4 

5.1 

th 

1.10 

0.6 

5.1 

tcl 

1.06 

0.4 

5.5 

V 

1.24 

1.4 

4.9 

kcl 

1.07 

0.4 

5.1 

dh 

1.28 

1.6 

4.8 

mean 

1.16 

0.9 

4.0 

mean 

1.13 

0.8 

2.9 

Vowels 

Ny/Nx 

dBgajjj 

E 

Nasals 

Ny/Nx 

dBgaj^ 

E 

iy 

1.20 

1.2 

1.5 

m 

1.34 

1.9 

5.6 

ih 

1.22 

1.2 

1.0 

n 

1.26 

1.5 

4.0 

eh 

1.20 

1.2 

0.1 

ng 

1.26 

1.5 

3.7 

ey 

1.18 

1.1 

0.2 

em 

1.30 

1.7 

5.9 

ae 

1.18 

1.1 

0.4 

en 

1.32 

1.8 

6.5 

aa 

1.28 

1.6 

1.2 

eng 

1.38 

2.0 

8.0 

aw 

1.25 

1.4 

0.9 

nx 

1.17 

1.0 

1.6 

ay 

1.24 

1.4 

0.3 

mean 

1.29 

1.6 

5.0 

ah 

1.25 

1.4 

0.4 

ao 

1.35 

1.9 

0.8 

oy 

1.30 

1.7 

0.4 

Glides 

Ny/Nx 

dBgaj^ 

E 

ow 

1.29 

1.6 

0.4 

1 

1.32 

1.8 

1.1 

uh 

1.29 

1.6 

0.7 

r 

1.21 

1.2 

0.8 

uw 

1.32 

1.8 

1.3 

w 

1.34 

1.9 

1.8 

ux 

1.25 

1.4 

1.6 

y 

1.24 

1.4 

3.0 

er 

1.17 

1.0 

0.5 

hh 

1.17 

1.0 

2.6 

clX 

1.28 

1.6 

2.4 

hv 

1.21 

1.2 

2.3 

ix 

1.22 

1.3 

2.7 

el 

1.33 

1.8 

1.5 

axr 

1.19 

1.1 

1.7 

mean 

1.26 

1.5 

1.9 

ajch 

1.15 

0.9 

4.3 

mean 

1.24 

1.4 

0.6 

132 


Table  5.5:  Equal  energy  phoneme  gains  for  linear  expansion  a = 0 with  the  warped  ap- 
proximation of  the  ISO-532B  for  the  1,681  sentences  of  the  TIMIT  test  set  with  SPL  levels 
between  50  and  80dB.  The  ratio  Ny/N^  is  the  loudness  increase  of  the  enhanced  phoneme 
to  the  original  (number  of  times  louder),  dBpai„  is  the  gain  from  Eq(5.13)  required  to  scale 
up  the  original  to  achieve  equal  loudness,  Ny/Ng^  is  the  loudness  increase  of  the  enhanced 
to  the  scaled  original  and  E = \\  — Ny/Ngx\  is  the  approximation  error  of  dBgaj„. 


Stops 

Ny/N^ 

dBgaiTj 

E 

Affricates 

Ny/N^ 

dBgajfj 

E 

b 

1.39 

2.1 

5.2 

jh 

1.09 

0.5 

2.7 

d 

1.17 

1.0 

1.8 

ch 

1.07 

0.5 

1.3 

g 

1.27 

1.5 

7.0 

mean 

1.08 

0.5 

2.0 

P 

1.21 

1.2 

4.6 

t 

1.09 

0.6 

1.9 

k 

1.14 

0.8 

3.3 

Fricatives 

Ny/N^ 

dBg(jj,j 

E 

dx 

1.27 

1.5 

4.8 

s 

1.11 

0.7 

3.5 

q 

1.30 

1.7 

7.4 

sh 

1.08 

0.5 

1.2 

bcl 

1.52 

2.7 

19.4 

z 

1.13 

0.8 

5.6 

del 

1.34 

1.9 

16.5 

zh 

1.08 

0.5 

2.8 

gel 

1.39 

2.1 

18.8 

f 

1.11 

0.7 

4.0 

pci 

1.19 

1.1 

11.1 

th 

1.15 

0.9 

8.4 

tcl 

1.21 

1.2 

12.9 

V 

1.45 

2.4 

14.9 

kcl 

1.17 

1.0 

9.5 

dh 

1.37 

2.0 

9.0 

mean 

1.26 

1.5 

9.0 

mean 

1.19 

1.1 

6.2 

Vowels 

Ny/N, 

dBgajfj 

E 

Nasals 

Ny/N^ 

dBgjjjjj 

E 

iy 

1.24 

1.4 

4.2 

m 

1.59 

3.0 

17.9 

ih 

1.24 

1.4 

2.7 

n 

1.51 

2.6 

16.4 

eh 

1.22 

1.3 

1.0 

ng 

1.52 

2.7 

17.8 

ey 

1.21 

1.2 

1.9 

em 

1.67 

3.3 

22.8 

ae 

1.20 

1.2 

1.0 

en 

1.61 

3.0 

20.9 

aa 

1.28 

1.6 

1.5 

eng 

1.63 

3.1 

22.4 

aw 

1.27 

1.5 

0.4 

nx 

1.21 

1.2 

3.9 

ay 

1.26 

1.5 

0.9 

mean 

1.53 

2.7 

17.5 

ah 

1.27 

1.5 

1.0 

ao 

1.36 

2.0 

0.2 

oy 

1.31 

1.7 

0.7 

Glides 

Ny/N:, 

dBp(jj,j 

E 

ow 

1.32 

1.8 

1.5 

1 

1.37 

2.0 

3.8 

uh 

1.32 

1.8 

2.6 

r 

1.23 

1.3 

1.8 

uw 

1.38 

2.1 

4.3 

w 

1.44 

2.3 

8.0 

ux 

1.31 

1.7 

4.6 

y 

1.29 

1.6 

6.4 

er 

1.19 

1.1 

1.6 

hh 

1.19 

1.1 

4.4 

ax; 

1.35 

1.9 

6.0 

hv 

1.25 

1.4 

5.2 

ix 

1.28 

1.6 

6.1 

el 

1.42 

2.2 

6.8 

axr 

1.23 

1.3 

4.7 

mean 

1.31 

1.7 

5.2 

cixh 

1.25 

1.4 

10.2 

mean 


1.28 


1.6 


2.7 


133 


segmentation  process  to  classify  the  voiced  regions,  for  which  the  spectral  flatness  measure 
was  selected.  The  evaluation  radius  is  adapted  as  a function  of  speech  tonality.  For  non- 
tonal  speech  segments  the  selection  of  the  evaluation  radius  should  have  a minimal  effect 
on  the  loudness  gain,  since  it  approaches  unity  gain.  Most  likely,  the  tonality  measure  is 
discriminating  only  between  speech  and  non-speech  regions,  and  the  majority  of  the  speech 
is  being  processed  by  the  warped  filter  with  bandwidth  expansion  enabled.  We  can  attribute 
the  loudness  gain  in  the  unvoiced  regions  to  the  sensitivity  of  the  spectral  flatness  measure. 

Table  5.5  determines  the  effective  gain  increase  with  the  loudness  approximation  of 
section  2.4.3.  Again,  we  should  expect  the  voiced  regions  to  have  the  greatest  loudness 
increase  due  to  the  resonant  nature  of  the  formants.  We  should  also  expect  the  greatest 
approximation  error  in  those  regions  of  speech  with  the  greatest  masking  effects,  such  as 
the  unvoiced  regions,  since  these  are  the  most  difficult  to  model.  Results  of  Table  5.5  are 
summarized  in  Table  5.8  for  the  phoneme  categories.  We  see  the  vowel  and  glide  regions, 
excluding  the  affricates,  have  the  lowest  approximation  error  and  the  highest  effective  gain. 
Loudness  gain  results  for  linear  bandwidth  expansion,  which  uses  the  true  ISO-532B  to 
calculate  the  gain,  are  more  acceptable  than  those  using  the  loudness  approximation.  The 
motivation  behind  the  approximation  was  to  use  the  approximation  in  real  time  to  equalize 
loudness  gains.  Results  indicate  the  approximation  is  suitable  for  voiced  regions  of  speech 
with  a vowel  error  of  2.7%,  but  may  not  be  suitable  for  unvoiced  regions  of  speech.  The 
next  two  tables  provide  results  for  the  warped  bandwidth  expansion  model. 

Warped 

Tables  5.6  & 5.7  present  the  warped  loudness  increase  given  by  the  true  ISO-532B  and 
the  loudness  approximation,  respectively.  Both  tables  provide  the  resulting  decibel  gain 
using  Eqs(5.13)  and  (5.14).  The  associated  error  of  the  approximation  is  provided  by  the 
true  ISO-532B.  In  both  tables  we  see  the  loudness  increase  is  greatest  for  the  voiced  regions 
of  speech.  In  Table  5.6  the  vowel  loudness  increase  is  2.1dB  with  an  approximation  error 
of  0.1%.  The  loudness  increase  in  Table  5.7  using  the  approximation  is  slightly  higher 
at  2.3dB  and  3.6%  error.  From  these  results  and  from  the  previous  tables  and  phoneme 
result  distributions,  we  should  note  that  the  dB  gain  is  very  sensitive  to  the  approximation 
error.  Only  for  small  approximation  errors  should  we  expect  the  dB  gain  to  be  a true 


134 


Table  5.6:  Equal  energy  phoneme  gains  for  warped  expansion  a = 0.5  with  the  true  ISO- 
532B  for  the  1,681  sentences  of  the  TIMIT  test  set  with  SPL  levels  between  50  and  80dB. 
The  ratio  NyjN^  is  the  loudness  increase  of  the  enhanced  phoneme  to  the  original  (number 
of  times  louder),  dBgaj„  is  the  gain  from  Eq(5.13)  required  to  scale  up  the  original  to  achieve 
equal  loudness,  Ny/Ng^  is  the  loudness  increase  of  the  enhanced  to  the  scaled  original  and 
E = \ l — Ny/Ngxl  is  the  approximation  error  of  dB^aj^. 


Stops 

NylN^ 

dBgaj;j 

E 

Affricates 

Ny/N^ 

dBpaj^j 

E 

b 

1.51 

2.6 

5.2 

jh 

1.10 

0.6 

3 

d 

1.24 

1.4 

2.9 

ch 

1.10 

0.6 

1.3 

g 

1.35 

1.9 

5.5 

mean 

1.10 

0.6 

2.1 

P 

1.27 

1.5 

3.8 

t 

1.11 

0.7 

1.4 

k 

1.17 

1.0 

2.4 

Fricatives 

Ny/N^ 

dBpajfi 

E 

dx 

1.36 

1.9 

3.5 

s 

1.18 

1.1 

2.2 

q 

1.35 

1.9 

3.9 

sh 

1.11 

0.7 

1.0 

bcl 

1.27 

1.5 

7.2 

z 

1.15 

0.9 

3.7 

del 

1.18 

1.1 

8.8 

zh 

1.08 

0.5 

2.5 

gel 

1.24 

1.4 

8.9 

f 

1.10 

0.7 

3.9 

pci 

1.12 

0.7 

6.5 

th 

1.12 

0.7 

6.2 

tcl 

1.11 

0.7 

7.6 

V 

1.33 

1.8 

6.9 

kcl 

1.12 

0.8 

6.9 

dh 

1.35 

1.9 

5.5 

mean 

1.24 

1.4 

5.3 

mean 

1.18 

1.1 

4.0 

Vowels 

Ny/N^ 

dBgajTj 

E 

Nasals 

Ny/N^ 

^^qain 

E 

iy 

1.35 

1.9 

0.2 

m 

1.47 

2.5 

0.7 

ih 

1.37 

2.0 

0.1 

n 

1.37 

2.0 

0.6 

eh 

1.35 

1.9 

0.1 

ng 

1.37 

2.0 

0.5 

ey 

1.33 

1.8 

0.0 

em 

1.42 

2.2 

0.7 

ae 

1.30 

1.7 

0.1 

en 

1.44 

2.3 

0.8 

aa 

1.42 

2.2 

0.2 

eng 

1.49 

2.6 

0.9 

aw 

1.37 

2.0 

0.1 

nx 

1.27 

1.5 

0.2 

ay 

1.36 

2.0 

0.1 

mean 

1.40 

2.2 

0.6 

ah 

1.38 

2.0 

0.1 

ao 

1.48 

2.5 

0.1 

oy 

1.44 

2.3 

0.1 

Glides 

Ny/N:, 

dBpajfi 

E 

ow 

1.42 

2.3 

0.1 

1 

1.45 

2.4 

0.1 

uh 

1.43 

2.3 

0.1 

r 

1.38 

2.1 

0.1 

uw 

1.48 

2.5 

0.1 

w 

1.50 

2.6 

0.2 

ux 

1.42 

2.2 

0.2 

y 

1.39 

2.1 

0.4 

er 

1.37 

2.0 

0.0 

hh 

1.23 

1.3 

0.3 

ax 

1.40 

2.2 

0.3 

hv 

1.29 

1.6 

0.3 

ix  • 

1.36 

2.0 

0.4 

el 

1.46 

2.4 

0.2 

axr 

1.36 

2.0 

0.2 

mean 

1.38 

2.1 

0.2 

axh 

1.20 

1.2 

0.6 

mean 

1.38 

2.1 

0.1 

135 


Table  5.7:  Equal  energy  phoneme  gains  for  warped  expansion  a = 0.5  with  the  approxima- 
tion of  the  ISO-532B  for  the  1,681  sentences  of  the  TIMIT  test  set  with  SPL  levels  between 
50  and  80dB.  The  ratio  Ny/N^  is  the  loudness  increase  of  the  enhanced  phoneme  to  the 
original  (number  of  times  louder),  dBgam  is  the  gain  from  Eq(5.13)  required  to  scale  up  the 
original  to  achieve  equal  loudness,  Ny  /Ng^  is  the  loudness  increase  of  the  enhanced  to  the 
scaled  original  and  E = \ \ — Ny/Ngj:\  is  the  approximation  error  of  dBgaj„. 


Stops 

Ny/N^ 

dBgain 

E 

Affricates 

Ny/N^ 

dBgain 

E 

b 

1.54 

2.8 

5.2 

jh 

1.10 

0.6 

3.0 

d 

1.21 

1.2 

0.7 

ch 

1.11 

0.7 

1.3 

g 

1.39 

2.1 

8.9 

mean 

1.11 

0.6 

2.1 

P 

1.30 

1.7 

5.4 

t 

1.12 

0.7 

2.0 

k 

1.19 

1.1 

3.9 

Fricatives 

Ny/N^ 

dBpain 

E 

dx 

1.42 

2.2 

6.6 

s 

1.21 

1.2 

4.6 

q 

1.49 

2.6 

10.3 

sh 

1.11 

0.7 

1.6 

bcl 

1.79 

3.7 

24.1 

z 

1.19 

1.1 

7.3 

del 

1.48 

2.5 

20.3 

zh 

1.10 

0.6 

4.1 

gel 

1.64 

3.1 

24.1 

f 

1.13 

0.8 

4.7 

pci 

1.30 

1.7 

14.0 

th 

1.17 

1.0 

8.9 

tcl 

1.31 

1.7 

16.6 

V 

1.59 

3.0 

17.1 

kcl 

1.25 

1.4 

12.5 

dh 

1.45 

2.4 

9.0 

mean 

1.39 

2.1 

11.0 

mean 

1.24 

1.4 

7.2 

Vowels 

Ny/N^ 

E 

Nasals 

Ny/N^ 

dBgain 

E 

iy 

1.41 

2.2 

5.2 

m 

1.8 

3.7 

20.7 

ih 

1.41 

2.2 

3.4 

n 

1.69 

3.4 

19.7 

eh 

1.39 

2.1 

1.3 

ng 

1.73 

3.5 

21.5 

ey 

1.38 

2.0 

2.5 

em 

1.93 

4.2 

27.0 

ae 

1.34 

1.9 

1.5 

en 

1.82 

3.8 

24.4 

aa 

1.43 

2.3 

1.2 

eng 

1.72 

3.5 

20.4 

aw 

1.41 

2.2 

0.7 

nx 

1.32 

1.8 

5.1 

ay 

1.40 

2.2 

1.7 

mean 

1.72 

3.4 

19.8 

ah 

1.41 

2.2 

1.3 

ao 

1.51 

2.6 

0.1 

oy 

1.47 

2.5 

1.1 

Glides 

Ny/N^ 

dBgain 

E 

ow 

1.47 

2.5 

1.6 

1 

1.51 

2.6 

4.0 

iih 

1.48 

2.5 

2.7 

r 

1.44 

2.3 

3.6 

uw 

1.56 

2.8 

5.0 

w 

1.63 

3.1 

8.7 

ux 

1.49 

2.6 

5.5 

y 

1.45 

2.4 

7.0 

er 

1.43 

2.3 

3.7 

hh 

1.27 

1.5 

6.0 

ax 

1.50 

2.6 

7.5 

hv 

1.34 

1.9 

6.6 

ix 

1.44 

2.3 

7.8 

el 

1.57 

2.9 

7.2 

axr 

1.48 

2.5 

8.4 

mean 

1.46 

2.4 

6.2 

axh 

1.33 

1.8 

12.9 

mean 


1.44 


2.3 


3.6 


136 


representation  of  the  actual  equal  loudness  gain.  Once  again,  the  loudness  gain  for  the 
ISO-532B  approximation  is  incorrectly  larger  than  the  true  gain,  and  expectedly  with  a 
larger  approximation  error. 

The  results  of  Tables  5.6  & 5.7  show  that  warped  bandwidth  expansion  provides  a 
slightly  higher  loudness  gain  in  voiced  regions.  As  noted  earlier,  the  reference  is  made  to 
original  loudness  and  does  not  consider  the  change  in  intelligibility  which  cannot  be  objec- 
tively evaluated.  Listening  tests  are  provided  in  chapter  6 which  removes  the  dependency 
of  intelligibility  on  these  results.  Subjective  results  will  show  a negligible  change  in  intelli- 
gibility. The  effective  dB  gain  improvement  for  the  warped  expansion  and  linear  expansion 
for  the  six  phoneme  categories  are  presented  in  Tables  5.8).  These  results  are  in  agreement 
with  Tables  5.3  which  showed  that  warped  bandwidth  expansion  provides  a slightly  greater 
loudness  ratio,  Ny/Nx,  than  linear  expansion.  The  true  ISO-532B  loudness  gain  results  of 
Tables  5.6  & 5.4  show  vowel  loudness  gains  of  1.4dB  and  2.1dB  for  the  linear  and  warped 
expansion,  respectively.  Table  5.8,  provides  the  dB  gain  improvement  we  were  seeking  in 
the  previous  section  to  establish  the  energy  gain  increase  for  equivalent  loudness.  The 
warped  bandwidth  expansion  results  are  satisfying  since  we  hypothesized  that  loudness  will 
increase  at  a greater  rate  on  a critical  band  scale.  Results  are  also  coincident  with  Figure 
5.4  which  displayed  the  loudness  gain  ratio  was  greatest  for  critical  band  expansion  as  a 
was  swept  from  0 to  0.8  in  increments  of  0.1,  where  a represents  the  warping  factor. 

Table  5.8:  Average  Loudness  increase,  equivalent  dB  gain,  and  approximation  error  for 
phoneme  categories  using  the  true  ISO-532B  for  a = 0 and  a = 0.5  from  Table(5.6)  for 
TIMIT  test  sentences  with  SPL  levels  between  50  and  80dB. 


Category 

Ny/Nx 

dBgain 

|l-iVv/^<?x| 

Ny/Nx 

dBpajfj 

|1  - Ny/Ngx 

vowels 

1.24 

a = 0 
1.4 

0.6 

1.38 

a = 0.5 
2.1 

0.1 

nasals 

1.29 

1.6 

5.0 

1.40 

2.2 

0.6 

glides 

1.26 

1.5 

1.9 

1.38 

2.1 

0.2 

stops 

1.16 

0.9 

4.0 

1.24 

1.4 

5.3 

affricates 

1.07 

0.4 

0.9 

1.10 

0.6 

2.1 

fricatives 

1.13 

0.8 

2.9 

1.18 

1.1 

4.0 

137 


5.2  Speech  Recognition 

An  objective  measure  of  speech  quality  is  one  which  captures  all  the  detail  of  what 
humans  would  ascribe  to  the  subjective  quality  of  speech.  There  are  no  direct  objective 
measures  to  evaluate  speech  quality  [76].  Speech  quality  is  a very  ambiguous  term  and  one 
that  is  too  elusive  to  describe  as  a single  quantity.  However,  speech  quality  is  a precursor 
to  speech  intelligibility,  of  which,  certain  objective  measures  are  available  [114,  100,  144]. 
In  chapter  6 we  will  see  that  certain  intelligibility  measures  can  be  used  to  evaluate  the 
transmission  characteristics  of  communication  channels.  However,  none  of  these  ratings 
can  evaluate  the  effect  of  speech  processing  techniques  on  quality  or  intelligibility.  We  may, 
though,  for  the  time  being,  consider  human  intelligibility  synonymous  with  recognition  by 
machine.  Speech  recognition  systems  attempt  to  make  a best  match  between  presented 
repetitions  of  speech  patterns.  Perhaps  speech  recognition  performance  provides  a means 
of  quantifying  machine  recognition  quality.  Speech  recognition  systems  have  been  used  for 
the  objective  quality  analysis  of  speech  in  voice  over  internet  systems  [61].  In  this  section 
we  will  provide  recognition  rate  results  for  speech  processed  by  the  warped  filter  of  Chapter 
4.  This  is  to  demonstrate  that  the  warped  filter  does  not  degrade  recognition  rates  for  the 
standard  cepstral  speech  recognition  front  end  of  the  DTW  and  HMM  systems. 

5.2.1  Spectral  Distortion  Measures 

Most  speech  recognition  systems  use  some  form  of  the  short-time  auditory  speech  spec- 
trum as  a front-end  feature  extractor:  filterbank,  cepstral,  mel-cepstral,  PLP,  and  RASTA 
[56,  57].  The  majority  utilize  some  form  of  the  cepstral  vector  derived  from  a filterbank 
designed  according  to  some  model  of  the  human  auditory  system  [68,  37].  The  speech  pat- 
tern is  a composition  of  the  feature  vectors  used  in  a pattern  classification  scheme.  The 
classifier  provides  a measure  of  similarity  between  speech  templates,  and  makes  a decision 
as  to  what  speech  token  was  most  likely  presented.  The  similarity  measure  is  usually  based 
on  a spectral  distortion  distance.  Many  spectral  distortion  measures  have  been  proposed 
which  are  closely  related  to  the  human  perception  of  sound  [39,  35,  38].  Interestingly,  the 
pole  enhancement  technique  is  also  known  as  a spectral  distortion  weighting  function  [84]. 


138 


A variation  of  the  pole  displacement  model  has  been  used  to  define  a frequency  weighted 
Itakura  distortion  measure  [125]. 


dwi 


1 

|A(re^u;)p  \A{e^u))\^ 


(5.15) 


The  Itakura  measure  is  a gain  independent  distortion  measure  also  known  as  the  log  like- 
lihood measure  dLLfi[100j.  It  is  an  asymmetric  distortion  measure,  and  creates  distortion 
characteristics  similar  to  the  masking  properties  of  the  auditory  system  [114].  Studies  in 
auditory  masking  have  shown  an  asymmetric  behavior  of  masking  between  tones  and  noise 
[40].  Noise  is  easier  to  hear  in  a tone  than  it  is  to  hear  a tone  in  noise.  The  Itakura  measure 
provides  this  asymmetric  quality  and  has  been  suggested  as  a subjective  distance  measure 
for  speech  [100].  Both  unity  gain  all-pole  models  are  compared  solely  on  the  basis  of  their 
spectral  shape.  The  Itakura  distortion  is  described  by  Eq(5.16)  where  the  distortion  is 
between  LPC  coefficients  a and  b of  speech  segments  Sa(n)  and  s/,(n),  — a^RaU  is  the 

prediction  error,  and  Rq  is  the  autocorrelation  of  So(”)- 

diLR  = In  (5-16) 


Distance  measures  in  speech  recognition  are  usually  chosen  to  be  sensitive  to  spectral 
variation,  and  the  Itakura-Saito  measure  is  one  of  the  more  successful  distortion  measures 
well  correlated  to  subjectively  perceived  distortions  [114,  100].  Table  5.9  provides  the 
correlation  index  of  various  objective  quality  measures  of  spectral  distortion  in  regards  to 
the  subjective  acceptance  scoring  of  the  Diagnostic  Acceptability  Measure  (DAM)  [21].  The 
correlation  index  p describes  prediction  of  the  objective  distortion  rating  to  the  measured 
subjective  quality.  A few  of  these  are  seen  to  be  moderately  well  correlated  with  subjective 
quality.  Composite  measures  which  mix  results  from  different  sets  of  measures  can  achieve 
higher  correlations  up  to  0.86. 

Table  5.9:  Comparison  of  the  average  correlation  coefficient  p between  Objective  and  Sub- 
jective speech  quality  [21]. 


LP  coefficients 

0.06 

Itakura 

0.59 

Linear  Spectral  Distance 

0.38 

Log  Area  Ratio 

0.62 

Reflection  Coefficients 

0.46 

Log  Spectral 

0.60 

139 


5.2.2  A Measure  of  Loudness  Distortion 

Interestingly,  the  right  hand  side  of  Eq(5.13)  represents  a form  of  loudness  distortion. 
The  result,  however,  is  used  to  determine  a gain  factor,  and  not  to  measure  the  similarity 
of  speech  patterns  for  speech  recognition.  Measures  of  loudness  distortion  such  as  the  Bark 
Spectral  Distortion  (BSD)  have  been  proposed  for  evaluating  objective  speech  quality  which 
show  moderate  correlation  to  Mean  Opinion  Score  (MOS)  test  results  [144].  Perceptual 
based  distortion  measures  have  also  been  proposed  as  objective  measures  of  speech  quality 
[143].  The  M-Itakura  (modified)  distortion  measure  is  a measure  of  loudness  distortion 
based  on  the  human  auditory  system  [46].  It  uses  the  PLP  prediction  set  as  the  predictor 
coefficients  in  the  Itakura  distortion  distance.  The  PLP  method  includes  the  three  principles 
of  psychophysics  previously  discussed,  and  includes  these  principles  in  the  PLP  coefficients. 
The  M-Itakura  distortion  measure  is  defined  as 

Dmi  = {Sa:,Sy)  = (5.17) 

where  the  all-pole  prediction  coefficients  a are  derived  from  the  PLP  power  spectrum.  The 
PLP  method,  however,  requires  an  FFT-IFFT  combination  in  addition  to  the  LPC  analysis 
for  critical  band  scaling.  The  M-Itakura  however,  cannot  be  directly  applied  to  Eq(5.9). 
The  Itakura  distortion  of  Eq(5.17)  represents  a power  spectrum  ratio  integral 


Di 


r 

J-.  |5(u;)P 


dw 


(5.18) 


and  Eq(5.9)  is  effectively  a ratio  of  integrals,  where  Y{w)  is  a filtered  version  of  the  power 
spectrum  |S'(rc)p,  and  the  prime  denotes  bandwidth  expansion, 


D = 


r i^'( 

J — 7T 

/7T 
•7T 


rc)r  dw 


(5.19) 


w)r  dw 


Distortion  measures  are  used  to  evaluate  the  similarity  of  speech  patterns  for  speech 
distortion  and  have  been  generally  confined  to  speech  recognition.  The  loudness  gain  model 
of  Eq(5.9)  introduces  a new  type  of  distortion  which  evaluates  the  subjective  percept  of 


140 


loudness  gain.  It  describes  the  equivalent  loudness  gain  for  equal  energy  spectra.  Total 
loudness  has  been  shown  to  be  a summation  of  specific  loudness,  and  an  overall  change  in 
loudness  is  the  difference  of  the  specific  loudness  spectra.  In  Chapter  2,  we  noted  that  the 
area  between  two  loudness  spectra  represents  the  specific  change  in  loudness.  Eq(5.19)  can 
be  equivalently  represented  as  a change  in  specific  loudness,  rather  than  total  loudness. 


Thus  we  can  rewrite  the  loudness  gain  function  of  Eq(5.12)  as 

^2k  — 1 ^ -/-TT  ^ 

A distortion  measure  which  evaluates  this  distance,  or  distortion,  provides  the  gain  scale 
to  achieve  equivalent  loudness.  The  log  spectral  distance  has  been  used  in  speech  recognition 
as  a subjectively  meaningful  distortion  measure  since  perceived  loudness  is  approximately 
logarithmic  [114].  The  loudness  difference  in  the  numerator  of  Eq(5.20),  however,  is  not 
logarithmic  and  contains  an  additional  filtering  operation.  The  nonlinear  compression  de- 
fined by  k,  is  similar  though,  to  the  behavior  of  a logarithm.  It  would  be  very  rewarding  to 
expand  the  loudness  gain  function  of  Eq(5.9)  with  a spectral  distortion  measure.  Extensive 
studies  have  been  conducted  on  the  family  of  speech  recognition  distortion  measures  [100], 
and  none  have  been  applied  for  the  calculation  of  equivalent  loudness. 

For  continuity,  we  will  complete  the  expansion  of  Eq(5.20)  with  the  log  spectral  dis- 
tortion as  an  approximation  to  the  nonlinear  compression  of  the  human  auditory  system. 
In  addition,  the  cepstral  distortion  is  investigated  for  its  similarity  to  the  log  spectral  dis- 
tortion. The  truncated  cepstral  distance  is  an  efficient  method  for  estimating  the  rms  log 
spectral  distance,  and  can  be  easily  derived  from  the  linear  prediction  coefficient  set  [110]. 
Additionally,  we  are  guaranteed  a warped  cepstral  model  can  be  derived  from  a warped 
linear  prediction  set.  For  this  reason  we  will  pursue  the  investigation  of  the  cepstral  model 
for  our  loudness  gain  function.  Parseval’s  theorem  relates  the  cepstral  distance  to  the  rms 


141 


log  spectral  distance  as, 


PTl  ^ 

/ llogPM  - logP(u;)'p  = 

J n = —TT  


(5.20) 


where  P{w)  represents  the  power  spectral  density  |5(ic)p.  The  cepstrum  of  a signal  is 
defined  as  the  Fourier  transform  of  the  log  of  the  signal  spectrum,  and  is  related  to  the 
power  spectrum  by 


OO 

logP(cj)  = ^ 

OO 


When  the  speech  is  modelled  by  a stable  minimum  phase  all-pole  spectrum  fP{uj)  the 
Fourier  Series  representation  of  log  P(u))  can  be  represented  by  a Taylor  Series  leading  to 
the  Laurent  Expansion  [114].  This  representation  can  be  differentiated  to  yield  a recursive 
formula  for  obtaining  the  cepstral  coefficients  from  the  autoregressive  coefficients.  The 
recursion  is  given  by  Eq(5.21)  where  cq  = logw^  and  the  real  cepstrum  is  the  even  sequence 

C—n  — Cji- 


Cji  — Un 


n—l 

k=l 


for  n > 0 


(5.21) 


The  pole  enhancement  technique  can  be  incorporated  in  the  cepstral  recursion  to  yield 
a pole  displaced  cepstrum  [84].  The  technique  has  been  used  to  model  noisy  speech  for 
speech  recognition  applications  in  adverse  conditions  [125].  Recall,  the  pole  displacement 
includes  an  off-axis  radius  term  to  scale  the  LPC  coefficients  by  a power  series  of  r, 

= ^(a,r-)  e-^- 
i=o 

For  r > 1,  \IA{rz)  is  still  analytic  within  the  unit  circle.  The  cepstral  recursion  can  be 
expressed  as 


71  — 1 

-nc„  - nan  = “ k)cn-kQk 

k=l 


multiplying  both  sides  of  the  equation  by  r ” and  rearranging  terms, 

n — l 

-ncnr“”  - na„r“”  = ^(n  - 
fc=i 


142 


a relationship  between  the  original  LPC  cepstrum  and  the  pole  displaced  LPC  cepstrum 
can  be  established  [84], 


model.  The  second  step  is  to  incorporate  a critical  band  scale.  The  Bark  or  Mel  frequency 
scale  is  subjectively  more  meaningful  than  a linear  frequency  scale.  Bark  and  Mel  scales 


a constant  physical  spacing  [149].  Cepstral-based  frequency  warping  techniques  have  been 
proposed  for  spectral  distortion  measures  in  speech  recognition  [83].  The  frequency  ctxis 
is  first  warped  to  a Bark  or  Mel  scale,  and  then  the  distortion  is  measured  to  make  a 
more  meaningful  distance  calculation.  Frequency  warping  can  also  be  carried  out  with  the 
distortion  computation  of  the  cepstral  coefficients  [100].  Parseval’s  theorem  showed  that 
the  rms  log  spectral  distance  between  two  spectra  S{uj)  and  5(cu)  is  equivalent  to  the  L2 
norm  cepstral  distance  in  Eq(5.20).  For  a frequency  warped  representation  5(cu)  is  replaced 
by  S(X(oj))  where  the  frequency  warping  function  is  A over  the  integration  of  the  Bark  scale 


Inclusion  of  the  frequency  warping  function  in  the  derivation  of  Eq(5.20)  yields  a warped 
cepstral  distortion  given  by  [100], 


(5.22) 


This  is  the  first  step  towards  reducing  our  loudness  gain  function  to  an  all-pole  derivative 


reflect  comparable  basilar  membrane  displacements  in  the  cochlea  and  are  associated  with 


B [114]. 


OO 


00 


/=— OO  m=—oo 


where 


143 


is  the  warping  function  and  can  be  evaluated  by  numerical  integration  along  the  Bark 
Scale.  Figure  5.6  shows  the  the  real  part  of  for  values  of  k [114].  The  warped 

cepstral  distance  can  be  written  in  Matrix  notation  as 


dlcep  = (c-c')^W(c-c') 


(5.23) 


k = 


1 


0 5 |(  = 3 10  15 


0 5 10  15 


k = 5 


0 5 10  15 


0 5 10  15 

Bark  scale  1 to  16.2 


Figure  5.6:  Real  part  of  as  a function  of  b,  the  Bark  scale,  for  values  of  k. 


Interestingly,  it  is  not  too  difficult  to  show  that  the  pole  displacement  model  can  be 
included  in  the  warped  cepstral  distance.  Inserting  Eq(5.22)  into  Eq(5.23)  we  can  propose 
an  extension  to  the  warped  cepstral  distortion  measure  as. 


dlctp  = (c-cr  *)^W(c-cr  *)  (5.24) 

where  cr~*  is  a power  series  scaling  of  the  length  i cepstral  sequence  c.  Assuming,  that 
the  outer  to  middle  ear  filter  is  precluded  in  the  cepstral  analysis,  and  the  logarithm  is  an 
appropriate  substitute  for  the  compressive  nonlinearity,  a reduction  of  the  loudness  gain 


144 


function  can  be  presented  as  a ratio  of  the  warped  cepstral  distortion  to  the  reference 
loudness. 


9 


2k 


^/lc  — cr  ’)^W(c  — cr  *) 


N-l 

^2 


k=0 


HM 


2k 


^_gj27rfc/TV 


(5.25) 


Eq(5.25)  suggests  that  the  equivalent  loudness  gain  in  vowels  for  a specific  bandwidth 
expansion  factor,  r,  can  be  approximated  through  knowledge  of  the  linear  prediction  analysis 
alone.  The  warping  matrix  is  fixed  [114],  and  the  cepstral  coefficients  along  with  the  warped 
all-pole  model  are  derived  by  the  cepstral  LPC  recursion  of  Eq(4.17),  [110].  Eq(5.25)  defines 
a novel  spectral  distortion  measure  for  determining  equivalent  loudness  by  means  of  an  all- 
pole derived  gain  function.  The  gain  function  of  Eq(5.25)  was  presented  for  mathematical 
illustration,  and  is  not  continued  in  the  remainder  of  this  chapter.  It  will  be  considered  for 
future  work. 


5.3  Recognition  Results 

Table  5.9  provided  the  correlation  index  of  certain  objective  quality  measures  with  com- 
posite acceptability  of  the  Diagnostic  Acceptability  Measure[21].  This  suggests  that  some 
spectral  distortion  measures  are  well  correlated  to  subjectively  perceived  distortion  dis- 
tances [144,  143].  A correlation  of  1.0  would  effectively  allow  us  to  substitute  the  objective 
measure  for  the  subjective  quality  speech  assessment  ratings.  Such  a measure  would  com- 
pletely capture  the  human  perception  of  speech  quality.  This  would  alleviate  the  need  for 
exhaustive  listening  and  evaluation  tests.  However,  as  we  see  in  Table  5.9,  such  objective 
measures  are  still  limited  in  their  ability  to  completely  predict  subjective  speech  quality. 
The  best  measures  of  speech  performance  are  provided  by  human  listeners,  and  in  Chapter 
6,  we  conduct  speech  assessment  listening  tests  to  rate  the  subjective  quality  of  speech 
processed  by  the  warped  filter.  We  can,  though,  compliment  the  subjective  ratings  with 
objective  measures  of  speech  performance  since  a good  correlation  exists.  The  log  spectral 
distortion  is  one  such  measure  commonly  used  in  speech  recognition  which  showed  a good 


145 


correlation  (0.6)  to  subjective  quality  ratings  in  Table  5.9.  The  log  spectral  distortion  is  a 
commonly  used  distance  measure  which  provides  good  performance  in  speech  recognition 
applications.  We  also  noted  in  Eq(5.20)  that  the  L2  cepstral  distortion  is  equivalent  to 
the  log  spectral  distortion,  and  a truncated  cepstral  distance  is  an  excellent  approximation. 
We  have  also  seen  that  the  cepstral  set  is  a robust  and  commonly  used  front  end  feature 
extractor  for  speech  recognition  applications. 

In  this  section  we  present  comparative  results  of  speech  recognition  performance  for 
speech  processed  by  the  warped  filter  and  unprocessed  speech  using  cepstral  analysis.  The 
Dynamic  Time  Warping  (DTW)  solution  and  Hidden  Markov  Model  (HMM)  systems  are 
used  to  determine  if  the  speech  enhancement  technique  of  the  warped  filter  degrades  recog- 
nition performance.  Our  intent  of  using  speech  recognition  performance  to  evaluate  quality 
is  a heuristic  measure,  but  one  which  should  be  documented.  Many  approaches  to  speech 
enhancement  have  been  favorably  used  to  improve  speech  recognition  performance  in  ad- 
verse conditions  [122,  58,  81,  27].  Similarly,  speech  recognition  systems  have  been  used  for 
the  objective  evaluation  of  certain  speech  enhancement  algorithms  [140,  82],  Our  intent  in 
evaluating  the  recognition  performance  is  to  show  that  the  warped  bandwidth  expansion 
does  not  degrade  the  recognition  rate.  Subsequently,  this  does  not  mean  that  a favor- 
able change  in  recognition  rate  implies  good  speech  enhancement.  Some  techniques,  such 
as  cepstral  mean  subtraction  and  cepstral  mean  normalization,  which  improve  recognition 
performance  are  not  suitable  for  speech  enhancement  [71,  79,  139].  Techniques  such  as  spec- 
tral subtraction,  adaptive  filtering,  auditory  modelling,  and  bandwidth  expansion  though, 
have  been  used  to  improve  recognition  rates  [1,  129,  115]. 

The  Speech  Recognition  Algorithm  Testing  Environment  (SR-8)  was  developed  to  per- 
form the  function  of  a complete  speech  recognition  testing  and  documentation  environment 
[6].  It  is  a collection  of  Matlab  files  whose  foundation  relies  heavily  on  parsing  routines 
and  scripting  functions.  Currently  it  supports  DTW  and  HMM  recognition  engines.  We 
developed  it  to  evaluate  the  performance  of  speech  recognition  algorithms  prior  to  cissembly 
integration  on  the  embedded  Motorola  DSP56600  platform.  The  DTW  and  HMM  results 
are  presented  to  note  if  the  speech  enhancement  technique  of  the  warped  filter  alters  recog- 
nition performance.  No  other  speech  recognition  tests  are  performed,  though  the  system 


146 


is  well  capable  of  doing  so  if  necessary.  The  SR-8  evaluation  environment  is  complete  for 
the  quantitative  documentation  of  recognition  performance.  Figure  5.7  shows  the  SR-8 
matlab  user  interface.  Either  of  the  two  recognition  systems  can  be  selected.  The  DTW 
approach  only  allows  for  first  order  statistics  and  is  thus  insensitive  to  feature  variation. 
The  HMMs  allow  for  second  order  statistics,  namely  the  variance,  which  is  unavailable  in 
template  based  pattern  matching.  The  discrete  HMMs  uses  the  K-means  VQ  method  with 
multi- labeling,  and  the  continuous  density  HMMs  support  single  gaussian  density  functions. 
The  multilabel  tag  in  Figure  5.7  provides  the  number  of  closest  matches  to  the  centroid 
VQ  codeword,  which  serves  as  additional  exemplars  for  discrete  HMM  training.  The  skip 
tag  denotes  the  maximum  number  of  state  transitions  the  HMM  can  jump.  The  HMMs  do 
not  incorporate  state  duration  modeling  or  state  dynamics.  A complete  description  of  the 
HMM  systems  are  provided  in  Appendix  C. 


1 

V,  HMM  I f*  Continuous  (*  Diagonal  Covafia 

t^DTW  j C"  discrete  Code..Bppk..32 

JL.I-  ...  .■■.■-ii.n  I 7 ; — — I— J 

States  10  

-iXII C~~-U 

Skips  0 ^ ^ 


Full  covariance 

Multi  abels  5 
— t,  . , ■ J 


SPEAKERS  V 

yORDS 

CONDITIONS 

answer 

3 

1 

902  ZJ 

area  code 

2 “ 

903 

call 

3 — ^ 

904  ^ 

<ZJiT 

4 H 

BEGIN  TRAINING 


EXAMPLES 


d 


Figure  5.7:  Speech  recognition  test  GUI. 


The  SR-8  currently  supports  3 Motorola  datasets  averaging  about  10,000  words  each. 
Each  dataset  is  a collection  of  uttered  vocabulary  words  spoken  in  selected  environments 
by  different  speakers.  The  datasets  are  the  property  of  the  Chicago  Motorola  Corporate 
Research  Laboratory  and  are  used  with  permission.  The  Startac  dataset  contains  12,344 
word  utterances  from  15  different  speakers  in  6 different  noise  conditions  were  used  for 
testing.  Each  speaker  utters  20  different  words  with  a repetition  1-7  times  per  word.  The 
Chicago  set  contains  7770  word  utterances  from  25  different  speakers  in  6 different  noise 
conditions.  Each  speaker  utters  19  words  with  a repetition  1-4  times  per  word.  The 


147 


Hands-free  dataset  contains  16,191  word  utterances  from  28  different  speakers  in  6 different 
noise  conditions.  Each  speaker  utters  26  words  with  a repetition  1-6  times  per  word. 
Different  environment  conditions  provide  a broad  range  of  conditions  the  recognition  system 
may  expect  to  encounter.  The  different  environments  provide  a broad  range  of  typically 
encountered  environmental  conditions.  A data  naming  convention  provides  information  as 
to  which  word  was  spoken,  in  what  environment  it  was  said,  and  so  on,  to  facilitate  data 
selection.  For  illustration,  Figure  5.8  presents  cepstral  and  filterbank  recognition  results 
for  the  Startac,  Chicago,  and  Hands-free  using  the  SR-8.  The  SR-8  was  used  to  evaluate 
the  speech  recognition  performance  of  our  prototype  cepstral  front  end  in  the  Motorola 
ilOOO  cell  phone  [5].  Speakers  and  conditions  are  grouped  together  for  brevity.  The  shown 
percentage  score  for  each  condition  is  the  average  recognition  percentages  of  all  the  speakers 
for  all  the  words  in  that  selected  condition. 


Figure  5.8:  SR-8  documentation  results  of  recognition  performance. 


5.3.1  DTW  Results 

The  Dynamic  Time  Warping  (DTW)  solution  is  a template  based  matching  routine  that 
selects  the  reference  template  with  the  smallest  spectral  distortion  distance.  It  attempts 
to  remove  the  temporal  differences  due  to  speaker  rates  and  variations,  and  to  align  the 
templates  as  smoothly  as  possible  for  similarity  decisions.  The  DTW  is  an  accumulated  dis- 
tortion measure  that  evaluates  the  pattern  dissimilarity  between  the  two  feature  templates 
under  consideration.  The  DTW  implemented  uses  a tree  search  routine  where  accumulated 
distortion  measures  are  projected  forward.  A minimum  cost  path  is  established  in  the  tree 


148 


search  routine  that  defines  the  measured  similarity  of  the  two  templates.  This  determines 
an  optimal  warping  path  in  the  search  tree  to  provide  the  best  match.  DSP  integration 
results  for  the  speech  recognition  performance  of  the  DTW  filterbank  in  the  Motorola  1600 
cell  phone  were  presented  in  [7]. 

Table  5.10  presents  the  DTW  speech  recognition  performance  results  for  one  speaker  of 
the  Stastac  database  for  warped  and  non-warped  speech  templates.  Both  results  provide: 
12th  order  cepstral  analysis  without  delta  cepstral  feature  and  use  only  two  examples  of 
each  word  in  condition  1 for  training.  The  DTW  utilizes  a one  step  forward,  backward, 
and  diagonal  path  constraint  with  unitary  weighting.  The  DTW  is  known  to  perform 
extremely  well  in  relation  to  the  training  requirements.  Table  5.10  shows  the  results  for 
the  unprocessed  data  with  an  overall  recognition  rating  of  97.2  for  all  conditions.  For  both 
tables,  each  column  index  represents  the  recording  conditions.  This  table  also  provides  the 
SNR  value  of  each  condition.  Table  5.10  shows  the  results  for  the  speech  processed  with 
the  warped  filter  with  an  overall  recognition  rating  of  95.0  for  all  conditions.  For  speaker 
901  the  recognition  ratings  differ  by  2%  which  is  small  enough  to  state  that  the  speech 
enhancement  method  has  a negligible  effect  on  performance,  and  thus  does  not  degrade  the 
machine  quality  of  speech. 

5.3.2  HMM  Results 

Table  5.11  presents  the  discrete  Model  HMM  speech  recognition  performance  results  for 
speaker  901  of  the  Stastac  database  for  warped  and  non-warped  speech  templates.  Both 
results  use  12th  order  cepstral  analysis  without  delta  cepstral  feature  with  6 state  discrete 
HMMs,  5 multilabels,  and  a VQ  codebook  size  of  64  words,  trained  with  the  K-means 
binary  split  algorithm.  The  HMMS  are  trained  from  only  2 repetitions  of  each  word  in  each 
condition.  Thus  they  use  12  examples  for  training  as  compared  to  only  2 words  for  the  DTW. 
Better  recognition  are  attained  when  more  conditions  and  more  repetitions  are  available 
in  each  condition  for  training  the  HMMs.  Table  5.11  presents  on  overall  average  80.5% 
recognition  performance  for  the  unprocessed  speech.  Table  5.11  presents  an  overall  average 
recognition  rate  of  79.5%  for  the  discrete  model  from  speech  processed  with  the  warped 


149 


Table  5.10:  DTW  results  for  original  and  warped  speech  templates:  Number  of  vocabulary 
words  correctly  recognized  for  speaker  901  in  Motorola  stars  database.  20  words  vs  6 
enumerated  conditions;  Train  in:  cond  1 rep  1 2,  Test  in:  cond  1 2 3 4 5 6 rep  all 


NO  WARPING  WARPING 


answer 

5 

7 

6 

5 

7 

7 

5 

7 

6 

4 

7 

7 

areacode 

5 

7 

6 

6 

6 

6 

5 

7 

6 

6 

6 

6 

call 

5 

8 

6 

6 

7 

6 

5 

8 

6 

6 

7 

6 

cancel 

5 

7 

6 

7 

6 

6 

5 

7 

6 

7 

6 

6 

castle 

5 

7 

6 

6 

7 

7 

4 

7 

6 

5 

7 

7 

chargecard 

5 

7 

6 

6 

5 

8 

5 

7 

6 

6 

5 

6 

clear 

5 

7 

6 

5 

6 

6 

5 

7 

6 

6 

6 

6 

creditcard 

6 

7 

6 

6 

6 

7 

6 

7 

6 

6 

6 

7 

dialoperator 

5 

7 

6 

6 

6 

6 

5 

7 

6 

5 

6 

6 

eighthundred 

5 

7 

6 

5 

6 

8 

5 

7 

6 

6 

6 

7 

emergency 

5 

7 

6 

6 

6 

5 

5 

7 

6 

5 

6 

5 

end 

5 

7 

5 

4 

6 

6 

5 

6 

5 

5 

6 

6 

friend 

5 

7 

6 

3 

6 

6 

5 

7 

6 

4 

6 

6 

home 

5 

6 

6 

6 

6 

6 

4 

7 

5 

3 

6 

6 

information 

5 

7 

6 

6 

6 

12 

5 

7 

6 

5 

6 

12 

office 

5 

7 

6 

4 

6 

6 

5 

7 

4 

3 

6 

3 

program 

5 

7 

5 

4 

6 

6 

5 

7 

5 

4 

6 

6 

secretary 

5 

7 

6 

6 

5 

8 

5 

7 

6 

6 

5 

8 

security 

5 

6 

6 

6 

6 

7 

5 

6 

6 

6 

6 

7 

store 

5 

7 

6 

6 

6 

6 

5 

7 

6 

6 

6 

6 

sum: 

101 

139 

118 

109 

121 

135 

99 

139 

115 

104 

121 

129 

total: 

101 

141 

118 

118 

124 

142 

101 

141 

118 

118 

124 

142 

% score: 

100 

99 

100 

92 

98 

95 

98 

99 

97 

88 

98 

91 

% final: 

97 

95 

filter.  Table  5.12  presents  the  results  for  a continuous  10  state  single  density  gaussians  with 
diagonal  covariance  matrices.  Cepstral  features  are  assumed  statistically  independent,  so 
full  covariance  matrices  are  reduced  to  diagonal  matrices.  Results  show  a 5%  difference  in 
recognition  performance  between  the  original  speech  and  speech  processed  by  the  warped 
filter  for  the  discrete  HMM  and  1%  for  the  continuous  HMM. 


150 


Table  5.11;  HMM  Discrete  results  for  vocabulary  words  correctly  recognized  for  speaker 
901  in  Motorola  stars  databa.se.  20  words  vs  6 conditions  for  original  and  warped  speech 
templates,  Train  in:  cond  1 2 3 4 5 6 rep  1 2 Test  in:  cond  1 2 3 4 5 6 rep  all 


NO  WARPING  WARPING 


answer 

3 

6 

6 

4 

5 

4 

1 

3 

2 

3 

5 

4 

areacode 

7 

7 

6 

6 

6 

6 

6 

7 

4 

4 

6 

6 

call 

7 

7 

6 

7 

6 

5 

6 

7 

6 

6 

6 

5 

cancel 

5 

5 

3 

4 

6 

7 

6 

5 

2 

4 

6 

6 

castle 

1 

6 

6 

6 

6 

7 

7 

4 

6 

5 

6 

8 

chargecard 

3 

6 

4 

4 

2 

4 

4 

2 

6 

6 

5 

4 

clear 

7 

8 

6 

6 

7 

6 

6 

7 

6 

6 

7 

6 

creditcard 

5 

2 

5 

5 

2 

6 

5 

2 

5 

5 

1 

6 

dialoperator 

5 

7 

6 

6 

6 

5 

5 

7 

6 

4 

6 

5 

eighthundred 

6 

6 

4 

5 

6 

8 

6 

5 

4 

5 

6 

8 

emergency 

7 

6 

5 

6 

5 

9 

7 

4 

5 

6 

3 

9 

end 

7 

4 

5 

5 

5 

7 

7 

3 

5 

5 

5 

7 

friend 

5 

6 

6 

4 

6 

6 

6 

4 

6 

4 

6 

6 

home 

7 

6 

5 

4 

4 

6 

5 

4 

1 

5 

4 

5 

information 

7 

5 

3 

5 

4 

8 

6 

4 

5 

5 

4 

8 

office 

1 

4 

1 

2 

3 

1 

1 

0 

0 

0 

2 

1 

program 

5 

7 

5 

3 

5 

6 

4 

5 

5 

4 

5 

6 

secretary 

0 

0 

0 

0 

0 

1 

0 

0 

0 

2 

0 

1 

security 

7 

7 

6 

6 

6 

7 

7 

7 

6 

6 

6 

7 

store 

7 

7 

6 

4 

6 

6 

7 

7 

6 

4 

6 

6 

sum: 

102 

112 

94 

92 

96 

115 

102 

87 

86 

89 

95 

114 

total: 

141 

141 

118 

118 

124 

142 

141 

141 

118 

118 

124 

142 

% score: 

72 

79 

80 

78 

77 

81 

72 

62 

73 

75 

77 

80 

% final: 

78 

73 

151 


Table  5.12:  HMM  Continuous  results  for  vocabulary  words  correctly  recognized  for  speaker 
901  in  Motorola  stars  database.  20  words  vs  6 conditions  for  original  and  warped  speech 
templates,  Train  in:  cond  1 2 3 4 5 6 rep  1 2 


NO  WARPING  WARPING 


answer 

7 

6 

6 

4 

6 

6 

7 

5 

6 

4 

6 

6 

areacode 

6 

7 

5 

6 

6 

6 

6 

7 

5 

6 

6 

5 

call 

7 

7 

6 

7 

6 

6 

6 

6 

6 

7 

6 

6 

cancel 

6 

5 

4 

5 

7 

6 

6 

5 

4 

5 

7 

6 

castle 

1 

6 

6 

6 

6 

8 

7 

5 

6 

5 

6 

8 

chargecard 

4 

7 

6 

6 

6 

6 

7 

5 

6 

6 

6 

6 

clear 

7 

8 

6 

6 

7 

6 

6 

7 

6 

6 

7 

6 

creditcard 

5 

3 

5 

5 

3 

7 

5 

3 

5 

5 

2 

7 

dialoperator 

6 

7 

6 

6 

6 

5 

6 

7 

6 

5 

6 

5 

eighthundred 

7 

7 

4 

5 

6 

8 

7 

5 

4 

5 

6 

8 

emergency 

7 

6 

5 

5 

5 

9 

7 

4 

5 

6 

3 

9 

end 

5 

1 

5 

4 

4 

4 

7 

3 

5 

5 

4 

3 

friend 

5 

6 

6 

5 

6 

6 

6 

4 

6 

5 

6 

6 

home 

7 

7 

5 

5 

5 

5 

5 

4 

4 

6 

4 

6 

information 

6 

5 

4 

5 

5 

8 

6 

4 

5 

5 

5 

9 

office 

1 

2 

1 

0 

3 

1 

1 

1 

0 

0 

2 

1 

program 

6 

7 

5 

3 

5 

6 

4 

5 

5 

4 

5 

6 

secretary 

0 

0 

3 

1 

1 

6 

0 

0 

4 

2 

0 

6 

security 

3 

2 

6 

6 

6 

7 

5 

4 

6 

6 

6 

7 

store 

7 

7 

6 

5 

6 

6 

7 

7 

6 

6 

6 

6 

sum: 

103 

106 

100 

95 

105 

122 

111 

91 

100 

99 

99 

122 

total: 

141 

141 

118 

118 

124 

142 

141 

141 

118 

118 

124 

142 

% score: 

73 

75 

85 

81 

85 

86 

79 

65 

85 

84 

80 

86 

% final: 

80 

79 

CHAPTER  6 

SUBJECTIVE  EVALUATIONS 


Standard  listening  tests  provide  reliable  measures  for  defining  speech  intelligibility  and 
acceptability  common  to  various  testing  institutions  [23].  The  final  evaluation  of  a speech 
processing  system  usually  require  some  form  of  measuring  intelligibility  or  acceptability  by 
human  listeners  [21],  There  are  two  general  categories  of  speech  evaluation:  Intelligibility 
tests  and  Acceptability  tests.  In  this  chapter,  listening  tests  are  conducted  to  evaluate  the 
subjective  increase  in  loudness  and  to  quantify  the  ISO-532B  analytic  results.  The  listening 
tests  are  conducted  to  substantiate  the  loudness  gains  and  correlate  the  analytic  gains  to 
perceptual  loudness  gains.  Three  loudness  listening  tests  are  performed:  Intelligibility, 
Loudness,  and  Acceptability.  The  intelligibility  tests  evaluate  the  discernibility  of  speech  at 
OdB  SNR.  The  loudness  tests  evaluate  the  effective  loudness  gain  through  a series  of  scaling 
and  comparison  procedures.  A sensitivity  screening  test  is  also  included  which  informally 
evaluates  the  listeners  hearing  acuity.  The  acceptability  tests  provide  the  overall  impression 
of  speech  quality  through  score  ratings.  It  also  includes  a final  comparison  of  the  loudness 
on  full  length  sentences.  The  listening  experiments  have  been  designed  to  determine  the 
subjective  loudness  gain  of  an  algorithm  and  modestly  evaluate  its  effect  on  intelligibility. 
Results  are  presented  to  elucidate  the  effects  of  the  warped  bandwidth  filter  on  speech 
intelligibility,  quality,  and  loudness. 

6.1  Measures  of  Speech  Intelligibility 

Speech  intelligibility  refers  to  the  accuracy  of  which  speech  can  be  meaningfully  in- 
terpreted and  understood.  Usually  speech  intelligibility  is  a reference  to  the  clarity,  ar- 
ticulation, and  pronunciation  of  spoken  words.  The  first  statistical  measures  of  speech 


152 


153 


intelligibility  were  proposed  around  the  beginning  of  the  twentieth  century  for  telephone 
and  electronic  communication  systems.  These  initial  measurements  of  intelligibility  relied 
on  articulatory  testing  [32],  In  these  tests  confusable  word  pairs,  nonsense  syllables,  and 
consonant  vowel  combinations  were  presented  for  intelligibility  evaluation  purposes.  The 
speech  corpus  and  methodology  of  these  tests  are  specified  by  the  ANSI  and  still  in  use: 
Diagnostic  Rhyme  Test  (DRT),  Modified  Rhyme  Test  (MRT),  and  Phonetically  Balanced 
word  lists  (PB).  In  these  experiments  the  measure  of  intelligibility  is  associated  with  a 
quality  of  articulation.  In  this  light,  intelligibility  can  be  considered  a quality  assessment 
of  speech  by  human  listeners. 

In  a similar  sense,  the  intelligibility  measures  can  be  regarded  as  a description  of  the 
incurred  transmission  affects  of  the  speech  from  the  speaker  to  the  listener,  and  not  neces- 
sarily an  intrinsic  quality  of  the  speech  itself.  The  transmission  medium  through  which  the 
speech  propagates  could  also  determine  the  quality  of  the  speech.  The  first  of  these  such 
experiments  was  to  determine  the  bandwidth  required  for  the  telephone  system  [32] . In  this 
reasoning,  speech  intelligibility  refers  to  the  expected  quality  of  speech  passing  through  a 
transmission  system  who’s  characteristics  determine  the  level  of  intelligibility.  This  is  the 
premise  of  articulation  theory,  that  the  intelligibility  of  every  listening  situation,  described 
by  back-ground  noise,  attenuation,  and  frequency  distortion,  speaking  level,  and  hearing 
acuity  of  the  listener,  can  be  described  by  an  intermediate  variable,  the  articulation  index, 
which  is  related  to  a speech  score  [99]. 

The  articulation  index  (AI)  proposed  by  Fletcher  was  created  to  measure  one  aspect 
of  speech  quality;  the  intelligibility  [21].  Fletcher  and  Galt’s  method  for  the  AI  is  a te- 
dious and  complex  graphical  procedure  which  was  revised  for  simplicity,  and  is  the  now  the 
ANSI-S3.5  1997  [32].  A Matlab  beta  version  of  this  standard  was  made  public  from  [99]. 
The  original  motivation  of  the  AI  was  to  provide  a probability  measure  of  communication 
systems  in  allowing  listeners  to  correctly  identify  speech  sounds.  The  AI  concept  provides 
a basic  transformation  of  the  probability  of  phone  correctness  into  a measure  additive  over 
frequency.  The  AI  assumes  individual  articulation  bands  contribute  independently  to  artic- 
ulation, and  that  collectively  these  bands  describe  the  overall  intelligibility.  The  articulation 
within  a band  represents  that  portion  of  the  signal  which  is  audible.  French  and  Steinberg 


154 


[114]  found  that  these  bands  contribute  about  equally  to  the  intelligibility  of  speech.  This 
filterbank  approach  suggests  a subband  analysis  process  in  which  partial  decisions  are  de- 
veloped independently  and  then  combined  at  the  level  of  phonetic  categorization  [41].  Like 
specific  loudness,  these  bands  are  normally  associated  with  the  critical  band  spectrum  or 
1 /3  octave  band  spectrum.  The  intelligibility  predictions  are  made  by  summing  up  the  indi- 
vidual speech  band  contributions.  The  critical  band  type  scale  reflects  the  fact  that  certain 
bands  make  larger  contributions  to  speech  intelligibility  than  other  bands.  Experimental 
results  have  shown  that  intelligibility  increases  at  a greater  rate  in  higher  frequency  bands 
than  lower  bands  with  signal  level  [99].  Studies  also  reveal  that  the  upper  formants  are  the 
primary  cues  for  intelligibility  [18] 

It  would  be  appealing  to  use  the  physically  measurable  characteristics  of  the  speech  sig- 
nal in  evaluating  intelligibility  or  speech  quality.  However,  such  physical  measures  should 
not  be  substituted  for  human  listening  tests  [143].  When  it  is  necessary  to  evaluate  the 
affects  on  speech  quality  subjective  listening  tests  are  usually  done  to  asses  the  resulting 
quality  of  the  processed  speech.  Evaluation  methods  such  as  the  Articulation  Index  (AI), 
Speech  Intelligibility  Index  (SII),  and  Speech  Transmission  Index  (STI)  are  measures  of 
speech  intelligibility  in  regards  to  the  channel  characteristics  of  a speech  transmission  sys- 
tem [126].  They  describe  the  intelligibility  degradation  which  can  be  expected,  and  are 
based  on  experimentally  determined  articulation  bands  which  contribute  to  the  overall  in- 
telligibility. These  measures  are  a function  of  the  signal  to  noise  ratio  in  the  channels  and 
require  an  estimate  of  noise.  In  speech  enhancement  applications,  noise  characteristics  be- 
come signal  dependant  and  violate  the  underlying  assumptions  of  AI  [21].  These  measures 
describe  the  quality  of  the  transmission  channel  in  passing  speech,  and  the  articulation 
index  represents  how  robust  the  channel  is  to  speech  degradation  from  the  perspective  of  a 
listener.  They  do  not  provide  an  intelligibility  value  of  the  speech  itself,  and  are  not  suitable 
for  evaluating  the  speech  quality  of  a speech  enhancement  or  synthesis  technique  [149,  21]; 
they  are  communicability  ratings.  We  cannot  use  these  index  measures  to  determine  the  de- 
gree of  degradation  a speech  processing  algorithm  will  have  on  speech.  The  best  subjective 
measures  of  speech  processing  systems  on  intelligibility  are  provided  through  Mean  Opinion 
Score  (MOS)  listening  tests  [21,  23].  However,  they  require  considerable  time,  effort,  and 


155 


cost.  Thus,  we  will  conduct  general  subjective  listening  tests  to  evaluate  the  effects  of  the 
warped  filter  on  speech  quality. 


6.2  Intelligibility  Test 

Intelligibility  tests  evaluate  the  number  of  words  or  speech  sounds  that  are  correctly 
classified  in  a controlled  situation.  Intelligibility  tests  attempt  to  model  the  true  conse- 
quences of  poor  speech  intelligibility  such  as  misperceptions,  confusions,  or  muddled  words. 
Such  a test  evaluates  speech  intelligibility  by  response  scoring.  Classification  responses 
provide  an  objective  score  as  a percentage  of  correct  responses.  The  Diagnostic  Rhyme 
Test  (DRT)  and  Modified  Rhyme  Test  (MRT)  are  tests  of  phoneme  intelligibility.  Single 
syllables  presented  for  identification  are  usually  used  in  a carrier  sentence  or  isolated  ut- 
terance. The  phoneme  test  can  be  open  response,  where  the  listener  has  to  provide  what 
they  heard,  or  a closed  response,  where  they  are  provided  response  answers,  as  a pair  or 
list  to  choose  from.  The  DRT  is  a two-alternative  closed  response  tests  in  which  initial 
consonants  differ  only  by  a distinctive  feature.  The  DRT  provides  diagnostic  feature  scores 
of  the  six  phonemic  features:  voicing,  nasality,  sibilation,  sustention,  graveness,  and  com- 
pactness [42].  The  MRT  is  a six-alternative  closed  response  test  with  a carrier  sentence  in 
which  testing  evaluates  the  initial  and  final  word  syllables,  but  does  not  provide  feature  di- 
agnostics. Harvard  Sentence  Tests  and  Phonetically  Balanced  (PB)  sentence  tests  evaluate 
subjective  intelligibility  of  sentence  words.  Words  in  the  context  of  a sentence  do  not  have 
the  same  acoustic  or  phonetic  cues  or  contrasts  of  isolated  presentations.  They  are  more 
susceptible  to  co-articulation  which  decreases  intelligibility,  but  are  more  discerning  within 
the  grammar  context  and  semantic  structure. 

An  important  aspect  to  consider  when  increasing  the  loudness  of  speech  is  the  result- 
ing intelligibility  of  the  modified  speech.  To  address  the  question  of  intelligibility,  we  have 
administered  a variant  of  the  Diagnostic  Rhyme  Test  [142].  The  Diagnostic  Rhyme  Test 
(DRT)  was  originally  designed  to  test  intelligibility  of  speech  coders  transmitted  over  noisy 


156 


channels.  We  employed  the  DRT  method  along  with  a simpler  vocabulary  used  by  Jun- 
qua  [72]  in  his  study  of  speech  intelligibility  and  the  Lombard  Effect.  Table  6.1  lists  the 
vocabulary  used  for  the  intelligibility  test. 

Table  6.1:  Vocabulary  of  words  used  for  Rhyming  Test  of  Intelligibility,  subdivided  into 
confusable  sets  I-III. 

I f,  s,  yes,  X 
II  a,  eight,  h,  k 
III  b,  c,  d,  e,  g,  p,  t,  V,  z,  three 


6.2.1  Procedure 

A listener  is  placed  in  a quite  room  in  front  of  a PC  running  Matlab.  In  the  Matlab 
environment,  the  test  is  executed  through  a graphical  user  interface,  which  allows  the  user 
to  read  the  test  instructions,  enter  personal  information  (name,  email,  native  language), 
and  perform  the  actual  rhyme  test.  The  rhyme  test  GUI  is  shown  in  Figure  6.1. 

SNR  =0dB 
Count  = 1 of  60 


r D 

r 3 

Play 

Nest 

Figure  6.1:  Intelligibility  test  GUI.  A presentation  of  60  random  TI-46  utterances  is  pre- 
sented to  the  listener  at  OdB  SNR,  of  which  50%  are  processed  by  the  warped  filter. 

For  each  test  question,  one  of  the  utterances  from  Table  6.1  was  randomly  selected  along 
with  another  word  from  the  same  confusable  set.  To  obscure  the  utterance,  white  noise  was 
added  to  the  utterance  after  algorithm  application.  The  listener  could  play  the  utterance 
as  many  times  as  desired  but  was  forced  to  choose  one  of  the  two  displayed  words.  The 
test  consisted  of  60  utterances  at  0 dB  SNR.  The  listener  listened  to  the  speech  utterances 
through  Koss  UR-20  padded  headphones.  The  test  took  about  12  minutes  for  each  of  the 
16  participants.  The  database  is  a subset  of  the  TI-46  speech  corpus,  consisting  of  8 male 
and  8 female  speakers.  Each  speaker  utters  each  of  the  20  words  in  Table  6.1  16  times. 


157 


It  is  a standard  database  used  for  recognition  of  isolated  words,  particularly  the  numerals 
zero-nine  and  the  American  English  alphabet. 


6.2.2  Intelligibility  Results 


Table  6.2  summarizes  the  results  of  the  rhyme  test  for  0 dB  SNR.  We  see  that  the 
average  change  in  intelligibility  for  all  16  listeners  is  around  -0.3%.  These  results  show  that 
the  warped  filter  does  not  degrade  speech  intelligibility.  Results  are  presented  in  percentage 
form  as, 


Pn 


Number  Correct 
Total  Number 


* 100% 


(6.1) 


The  ± bars  represent  a 95%  confidence  interval  of  the  mean  across  listeners  for  each 
confusion  set.  Equation(2)  with  z = 1.96  sets  the  confidence  interval  to  95%. 


P{x  — Zi. 


(5/2 


a 


<r/<x-Zi+rf/2^) 


(6.2) 


Table  6.2:  Average  intelligibility  results  of  the  rhyme  test  for  16  listeners  hearing  60  words 
with  0 dB  SNR.  Table  results  are  displayed  as  the  percent  correct  population  mean 
with  ±95%  Confidence  Level 


Unprocessed 

Processed 

difference 

All 

90.2  ± 3.5 

89.9  ±3.1 

-0.3 

I 

91.7  ±7.5 

91.4  ±5.0 

-0.3 

II 

88.3  ±6.9 

90.9  ±5.1 

±2.6 

III 

90.8  ±4.1 

89.5  ±4.4 

-1.3 

158 


6.3  Loudness  Test 

Table  6.3  presents  the  vocabulary  words  used  in  the  isolated  word  loudness  listening 
tests.  Both  confusable  (C)  and  non-confusable  (NC)  words  are  used  in  the  test.  The  test 
is  structured  such  that  each  word  has  equal  probability  of  being  selected.  Word  set  V3  is 
not  included  in  the  loudness  test. 

Table  6.3:  Vocabulary  of  words  used  for  Loudness  Test 


VI  NC  zero,  one,  two,  four,  five,  six,  seven,  nine 
V2  NC  enter,  erase,  help,  repeat,  right,  rubout 
V4  C a,  eight,  h,  k 

V5  C b,  c,  d,  e,  g,  p,  t,  v,  z,  three 
V6  C m,  n 

The  loudness  listening  tests  are  similar  in  procedure  to  the  Intelligibility  tests.  The  main 
idea  behind  these  tests  is  to  have  a listener  select  which  of  two  spoken  words  is  louder. 
The  results  of  these  choices  will  reveal  by  how  many  dB  an  algorithm  can  perceptually 
increase  loudness,  on  average.  Figure  6.2  shows  the  loudness  listening  test  GUI  the  user 
interacts  with  to  make  his/her  selection.  The  objective  of  the  loudness  listening  tests  is 
to  estimate  the  dB  gain  improvement  the  loudness  enhancement  algorithm  provides.  Since 
the  algorithm  does  not  increase  speech  signal  power,  a scaled  comparison  test  is  required 
to  determine  the  perceptual  dB  gain.  There  are  no  known  listening  tests  which  explicitly 
quantify  speech  loudness  gain.  We  have  developed  a loudness  listening  test  which  we  feel 
generates  results  needed  to  approximate  the  loudness  level  increase. 

6.3.1  Procedure 

In  section  3.2.2  and  Table  5.3,  bandwidth  expansion  was  seen  to  elevate  the  sone  level 
of  vowels  in  synthetic  and  TIMIT  test  sentences  using  the  ISO-532B  loudness  analysis.  The 
technique  also  shows  a noticeable  loudness  increase  from  informal  subjective  listening  tests. 
The  subjective  procedure  we  developed  for  the  loudness  tests  is  simply  a way  to  determine 
how  much  dB  room  we  have  before  the  algorithm  does  not  provide  any  further  loudness  in- 
crease. A total  of  80  words  are  presented  to  each  listener.  80%  of  these  words  are  allocated 


159 


specifically  to  the  formant  expansion  method  of  the  warped  filter.  The  remaining  percent- 
age is  distributed  between  an  energy  redistribution  method  and  a screening  evaluation.  A 
speech  utterance  is  randomly  selected  from  Table  6.3  of  the  TI-46  database  as  the  control 
word.  The  word  is  processed  with  the  warped  filter.  We  refer  to  this  word  as  the  test  word. 
During  the  listening  test  the  word  is  scaled  down  and  presented  for  comparison  against  the 
control  word.  The  level  at  which  it  sounds  as  loud  as  the  control  word  represents  the  dB 
improvement. 

The  listener  is  not  provided  information  as  to  which  word  was  modified.  The  presenta- 
tion order  of  the  examples  is  also  random  to  avoid  control  tracking.  For  each  presentation, 
the  modified  word  will  undergo  a single  random  scaling  between  OdB  and  3dB  in  0.5  dB 
increments.  The  idea  is  to  randomize  the  scaling  gradient  so  as  not  to  let  the  listener  know 
which  loudness  level  is  being  tested.  The  listener  will  select  which  of  the  two  presented 
examples  is  louder.  After  hitting  the  ’next’  button,  another  pair  will  be  presented.  The  lis- 
tener is  required  by  software  to  hear  both  words  at  least  once  and  no  more  than  three  times 
each.  The  test  will  only  proceed  after  the  listener  has  made  their  selection.  A results  file  for 
each  listener  is  generated  which  contains  their  loudness  choices  categorized  for  each  word,  a 
list  of  the  words  presented,  and  the  scale  factors  corresponding  to  each  word.  These  results 
determine,  on  average,  at  what  dB  level  decrease  the  modified  algorithm  sounds  subjec- 
tively as  loud  as  the  original.  This  level  is  in  a region  considered  the  algorithm’s  perceptual 
gain. 

; ;■  ■'  Count  = 1 of  500 


Playl 

.•f.  , 

Play  2 

G 1 is  Louder 

G.  2 is  Louder 

Figure  6.2:  Loudness  test  GUI.  A total  of  80  words  are  presented  to  each  listener  of  which 
85%  of  the  words  are  processed  by  the  warped  filter.  A random  scaling  gradient  between  0 
to  5 dB  in  increments  of  1 dB  is  applied  to  the  speech  tokens  to  determine  the  perceptual 
gain. 


160 


6.3.2  Sensitivity  Screening 

In  developing  the  loudness  test  we  wanted  to  assure  ourselves  of  good  data  collection, 
in  the  sense  that  good  data  analysis  could  be  performed,  and  that  our  listeners  had  good 
hearing.  Therefore,  we  decided  to  include  a sub-test  within  the  loudness  experiments  which 
would,  at  a general  level,  reveal  the  auditory  resolution  of  the  listener.  We  wanted  to 
determine  how  well  our  listeners  were  able  to  discriminate  loudness  differences  without 
letting  them  know  we  were  testing  their  sensitivity.  And,  to  see  if  they  were  paying  attention 
since  these  tests  can  be  redundant  and  fatiguing.  This  information  would  also  serve  a 
supportive  role  in  backing  our  algorithm  loudness  results  to  an  auditory  base  reference.  For 
example,  it  would  be  pleasing  to  have  results  stating  an  algorithm  improvement  of  2dB, 
but  very  displeasing  to  realize  that  the  listeners  minimum  loudness  resolution  was  3dB. 
We  want  to  ensure  our  listeners  can  discriminate  loudness  clearly  enough  to  evaluate  our 
algorithms.  During  the  test,  we  would  present  two  words  from  the  dataset,  but  one  would 
not  be  algorithmically  modified.  It  would  simply  be  a dB  scaled  version  of  the  other  word. 
In  this  manner,  we  could  determine  their  minimum  sensitivity  to  loudness  on  average  upon 
completion  of  the  exam. 

The  motivation  of  the  subject  sensitivity  screening  is  to  provide  a reference  to  how  acute 
the  listener  is  in  regards  to  auditory  sensitivity.  Figure  6.3  shows  the  sensitivity  screening 
results  of  the  loudness  listening  test  for  all  16  listeners  as  the  dotted  lines.  The  vertical  bars 
represent  a 95%  confidence  interval  of  the  mean  across  listeners.  It  should  be  expected  that 
at  OdB  there  is  complete  guessing,  since  the  words  are  identical,  and  so  we  see  a 50%  choice 
response.  We  would  expect  a 50%  average  up  to  about  IdB  which  is  the  average  base  level 
for  loudness  discrimination  in  the  human  auditory  system  for  pure  tones[149].  Our  listeners 
seemed  to  do  better,  perhaps  due  to  the  quiet  listening  conditions  and  headphones.  For 
the  most  part,  it  seems  on  average  that  the  listeners  are  able  to  attain  good  separation 
below  a IdB  difference.  It  is  also  known  that  on  average  human  auditory  sensitivity  is  poor 
below  IdB  and  this  level  is  the  general  minimum  at  which  a loudness  increase  is  noticeable 
[149].  However,  this  is  the  general  drop  on  pure  tones  and  not  speech.  There  will  be  factors 


161 


contributing  to  the  minimum  perception  of  loudness  discrimination  when  speech  is  used  as 
the  reference. 

6.3.3  Loudness  Results 

The  following  section  presents  results  for  the  combined  statistics  of  linear  and  warped 
bandwidth  expansion.  A preliminary  test  was  conducted  to  evaluate  the  overall  testing 
procedure,  aesthetics,  and  algorithm  performance  expectations:  i.e.,  listener  comfort,  com- 
plaints, suggestions,  problems,  bugs,  presentation  level,  environment.  Good  results  are 
contingent  on  properly  developed  experiments  as  well  as  sufficient  data.  There  is  a com- 
promise as  to  how  much  data  can  be  collected  from  a single  listener  before  sacrificing  their 
attention,  concentration,  and  available  time.  We  want  to  preserve  the  integrity  and  interest 
of  the  experiment  and  make  the  listener  experience  comforting  so  as  not  to  sacrifice  quick 
and  unconcentrated  decisions.  The  initial  tests  also  served  to  determine  the  available  range 
in  terms  of  algorithm  performance  and  expectations.  We  determined  that  a 45  minute 
listening  test,  is  just  short  enough  of  being  tiresome  or  frustrating  to  the  listener,  and  long 
enough  for  well  bounded  results. 


Figure  6.3:  Solid  line  shows  the  effective  dB  gain  of  the  warped  filter  lies  just  above  the 
2dB  crossover  point.  Graph  provides  a scaled  comparison  of  the  average  loudness  ratings 
for  the  TI-46  words  processed  by  the  warped  filter  presented  to  16  listeners.  The  dotted 
line  corresponds  to  the  sensitivity  screening,  which  shows  the  listeners’  hearing  resolution 
is  well  separated  at  2dB.  Bars  are  the  95%  confidence  intervals  of  Eq(6.2). 


162 


Figure  6.3  presents  results  of  the  loudness  listening  test  for  the  warped  filter  of  Eq(4.38) 
with  a = 0.5.  The  loudness  listening  test  presented  80  words  to  each  of  16  listeners.  85% 
of  the  words  were  randomly  processed  with  the  warped  filter,  and  the  remaining  15%  were 
selected  for  sensitivity  screening.  The  graph  shows  the  available  dB  gain  improvement  of 
the  algorithm  by  determining  at  what  point  equal  loudness  is  perceived.  At  0 dB  difference 
(equal  power)  the  enhanced  word  was  selected  as  being  louder  90%  of  the  time.  The  cross 
over  point  shows  the  dB  gain  at  which  the  method  of  formant  expansion  appears  subjectively 
as  loud  as  the  unprocessed  original  data.  The  cross  over  occurs  just  above  2dB.  Below  this 
range,  there  is  clear  separation  in  listener  choices.  Since  the  crossover  is  the  50%  guess 
point,  we  are  relatively  confident  in  the  loudness  gain  achieved  by  the  algorithm.  The 
brackets  represent  the  95%  confidence  intervals  as  presented  in  Eq(6.2). 

The  cross  over  point  represents  the  dB  level  drop  at  which  the  modified  words  sounds 
as  loud  as  the  unmodified  words  on  average.  It  should  be  clear,  that  at  one  point,  a 
sufficient  enough  dB  drop  in  the  modified  algorithm  would  weigh  the  choice  towards  the 
nonmodifed  word,  regardless  of  how  good  the  algorithm  is.  At  this  point  an  equal  separation 
of  percentages  would  be  observed  on  the  far  left  vertical  axis  and  the  far  right  vertical  axis. 
We  expected  a 5dB  drop  to  provide  a sufficient  enough  range  to  balance  the  plot,  and 
reasonable  enough  to  feel  confident  there  were  no  underlying  experimental  problems.  It  is 
encouraging  to  see  an  intersection  that  opens  outward  left  with  equal  slope  as  it  opened 
right  inward.  Table  6.4  presents  the  distribution  results  of  Figure  6.3  for  warped  bandwidth 
expansion  {a  = 0.5). 

Table  6.4:  Loudness  listening  test  for  warped  filter  with  a = 0.5:  Total  number  of  times 
the  processed  word  was  selected  over  the  original  word  for  all  16  listeners 


dB  drop 

Original 

Processed 

0 

13 

168 

1 

39 

142 

2 

82 

94 

3 

124 

71 

4 

156 

30 

5 

168 

11 

163 


6.4  Acceptability  Test 

Acceptability,  or  quality,  tests  evaluate  the  acceptability  of  the  system  based  on  listener 
judgements  of  subjective  voice  quality  [142],  These  tests  asses  acceptability  in  addition  to 
intelligibility,  since  degradations  can  affect  the  aesthetic  quality  of  speech.  Good  intelligi- 
bility is  usually  well  correlated  with  subjective  quality,  though  not  always.  Some  situations, 
such  as  noise  suppression  can  improve  quality  scores,  but  can  decrease  intelligibility  scores 
[21].  These  tests  are  usually  in  the  form  of  paired  comparisons  or  rating  scales.  Paired 
comparison  tests  solicit  the  listeners’  preference  of  various  speech  processing  systems  and 
requires  a response  stating  which  is  more  acceptable,  such  as  vocoder  Mean  Opinion  Score 
(MOS)  listening  tests,  or  the  Diagnostic  Acceptability  Measure  (DAM).  Paired  comparison 
tests  can  also  be  performed  against  a control  reference.  Such  tests  may  evaluate  the  quality 
of  one  system  at  different  levels  of  additive  noise.  Rating  scale  tests  require  the  listener  to 
rate  the  subjective  quality  by  assigning  a value  [22].  The  value  can  be  on  a numeric  scale 
0-5  or  on  a categorical  scale;  excellent,  good,  fair,  poor,  and  bad  [24].  Rating  tests  are 
simple  and  practical,  and  experimental  evidence  indicates  that  the  rank  orderings  assigned 
by  the  ratings  scale  are  well  correlated  to  paired  comparison  tests  [142]. 

6.4.1  Procedure 

A total  of  20  phonetically  balanced  training  sentences  from  the  TIMIT  database  is 
presented  to  the  same  16  listeners.  The  sentences  are  selected  at  random  from  the  database. 
The  listening  test  GUI  for  the  acceptability  test  is  shown  in  Figure  6.4.  Each  sentence  and 
a warped  filter  version  of  the  sentence  is  presented  to  the  listener  for  loudness  comparison 
and  quality  assessment.  The  first  procedure  is  similar  to  that  of  the  loudness  test.  The 
listener  is  required  to  select  which  of  the  two  sentences  is  louder.  Both  sentences  are  at 
equal  power  on  a frame  by  frame  basis,  i.e.  there  is  no  gain  scaling,  and  the  overall  power 
level  is  that  of  the  original  sentence.  All  20  sentences  are  first  evaluated  for  loudness.  The 
second  procedure  evaluates  the  acceptability  level,  or  quality  rating  of  both  sentences.  The 
listener  is  required  to  rate  the  quality  of  both  sentence  as:  excellent,  good,  or  fair. 


164 


a good  attitude  is  unbeatable 
Count  = 1 of  20 


Figure  6.4:  Acceptability  test  GUI.  A presentation  of  20  sentence  pairs  are  presented  to  the 
listener  to  rate  quality  and  overall  loudness.  One  sentence  in  each  pair  is  processed  by  the 
warped  filter,  and  the  other  is  the  original. 

6.4.2  Acceptability  Results 

The  acceptability  test  was  included  to  evaluate  the  overall  quality  of  speech  processed 
by  the  warped  filter.  The  intelligibility  tests  are  based  on  single  word  utterances  in  noise, 
and  the  loudness  tests  are  also  word  comparison  tests.  Because  the  acceptability  test  is  on 
complete  sentences,  it  demonstrates  how  the  warped  filter  affects  continuous  speech.  It  is  a 
preview  to  how  the  warped  filter  truly  affects  the  subjective  quality  of  speech.  The  sentence 
is  processed  on  a real  time  basis  and  reflects  how  the  filter  would  actually  process  speech. 
This  test  demonstrates  the  implementation  of  the  warped  filter  as  a speech  enhancement 
method. 

Table  6.5  presents  the  results  of  the  acceptability  test  for  all  16  listeners.  It  provides 
the  listeners  average  quality  rating  of  the  20  original  (A)  sentences  and  processed  (B) 
sentences.  The  processed  sentences  are  the  original  sentences  passed  through  the  warped 
filter  with  a = 0.5.  Results  in  Table  6.5  indicate  that  listeners  perceived  both  the  original 
and  processed  sentence  to  be  at  the  same  relative  quality.  The  quality  rating  was  excellent- 
1,  good-2,  and  fair-3.  The  listener’s  total  average  response  was  A=1.56  for  the  original 
sentence  and  B=1.47  for  the  processed  sentences.  These  results  indicate  that  the  overall 


Rating: 
t*  Excellen 

Rating: 

(*  Excelleni 

C good 

C good 

fair 

C fair 

165 


warped  filter  process  does  not  disturb  the  original  quality  of  the  speech.  Table  6.5  also 
provides  the  comparative  loudness  sentence  tests.  Results  demonstrate  that  the  overall 
perception  of  the  processed  speech  was  louder  by  a 18:2  ratio,  implying  the  processed 
sentences  sound  louder,  on  average,  90%  of  the  time. 

Table  6.5:  Sentence  acceptability  results  for  original  sentences  (A)  and  processed  sentences 
(B)  with  warped  filter  for  a = 0.5.  Twenty  random  sentences  from  the  TIMIT  dataset 
were  presented  to  each  of  16  Listeners.  The  Quality  rating  1 (excellent)  to  3 (fair)  is  their 
mean  response  for  the  20  sentences,  and  column  is  the  number  of  times  a sentence  was 
selected  as  being  louder.  It  is  given  as  a percentage  in  the  last  column 


# 

sex 

age 

Quality /I 

Quality  b 

#Lb 

% 

1 

M 

30 

1.60 

1.10 

0 

20 

100 

2 

M 

22 

1.30 

1.15 

1 

19 

95 

3 

F 

24 

1.25 

1.35 

0 

20 

100 

4 

M 

25 

1.70 

1.50 

0 

20 

100 

5 

M 

32 

2.50 

1.90 

2 

18 

90 

6 

M 

23 

1.45 

2.35 

5 

15 

75 

7 

M 

23 

2.10 

1.05 

0 

20 

100 

8 

M 

23 

2.15 

2.10 

7 

13 

65 

9 

M 

23 

1.15 

1.00 

2 

18 

90 

10 

M 

28 

1.50 

1.40 

5 

15 

75 

11 

M 

22 

1.10 

1.70 

1 

19 

95 

12 

M 

31 

1.85 

1.25 

0 

20 

100 

13 

M 

23 

1.50 

1.50 

0 

20 

100 

14 

M 

29 

1.55 

1.50 

0 

20 

100 

15 

M 

22 

1.70 

1.50 

0 

20 

100 

16 

M 

25 

2.10 

1.50 

0 

20 

100 

mean 

1.65 

1.52 

1.25 

18.75 

94 

CHAPTER  7 
CONCLUSIONS 


The  goal  of  this  research  hcis  been  to  develop  a speech  enhancement  technique  which 
increases  the  perception  of  loudness  without  increasing  signal  energy.  Such  a motivation 
requires  biological  inspiration  and  ingenuity  in  speech  signal  processing.  The  first  chapter  of 
this  research  has  been  to  understand  the  role  of  loudness  in  the  human  sensation  of  hearing, 
since  such  a technique,  by  virtue,  must  exploit  the  psychoacoustic  nature  of  the  auditory 
system.  We  have  reviewed  in  detail,  the  physiological  mechanics  and  the  psychoacoustic 
nature  of  the  peripheral  auditory  system.  We  have  studied  the  role  of  auditory  filters,  ex- 
citation functions,  suppression  and  masking,  the  critical  band  concept,  and  revealed  why 
loudness  increases  as  a critical  band  is  exceeded.  We  have  presented  in  detail  both  Zwicker’s 
and  Moore  & Glasberg’s  model  of  loudness.  We  have  disected  the  ISO-532B  loudness  anal- 
ysis and  described  why  loudness  is  a function  of  the  three  main  principles  of  psychophysics; 
critical  band  analysis,  power  law  of  hearing,  and  the  outer  to  middle  ear  transfer  character- 
istics. We  proposed  an  approximation  using  the  PLP  method  which  conforms  to  this  model 
with  an  average  error  of  less  than  1 sone  for  normal  speech  levels  in  all  phoneme  categories. 
We  can  attribute  the  error  to  a variation  of  the  exponent  in  the  compressive  nonlinearity  for 
tonal  and  nontonal  signals  in  the  loudness  model.  In  Zwicker’s  model  the  exponent  varies 
from  0.3  to  0.23  for  tonal  and  non-tonal  components.  A slight  change  in  the  exponent  has 
a noticeable  effect  on  the  final  loudness.  As  an  analogy,  hearing  impaired  patients  exhibit  a 
reduction  of  dynamic  range  in  loudness  recruitment  due  to  a complication  of  the  nonlinear 
compression,  and  our  loudness  approximation  suffers  the  same  disability. 


166 


167 


In  chapter  3,  we  provided  phoneme  analysis  of  the  TIMIT  database,  and  showed  that  of 
all  the  phoneme  categories,  vowels  are  the  most  suitable  candidates  for  bandwidth  expan- 
sion. Results  in  Table  3.1  showed  that  vowels  have  82%  of  the  total  speech  energy,  smooth 
spectral  envelopes,  and  are  of  the  longest  duration  in  comparison  to  the  stops,  fricatives, 
affricates,  nasals,  and  glides.  We  also  showed  that  vowels  have  relatively  little  masking  (4%) 
and  accessory  loudness  (6%)  contributions.  We  also  provided  studies  which  revealed  that 
vowel  formant  bandwidths  can  be  excessively  widened  without  degrading  their  identifica- 
tion. We  also  provided  studies  in  section  3.2.1  which  reveal  that  loudness  adaptation  is  a 
principle  component  of  auditory  fatigue.  For  sounds  in  which  adaptation  is  evident,  broad- 
ening the  spectral  distribution  will  decrease  adaptation  and  may  suppresses  the  elevation 
of  the  loudness  threshold. 

The  first  realization  of  a loudness  enhancement  filter  materialized  from  the  models 
of  loudness  presented  in  chapter  2.  For  an  equal  energy  bandwidth  product,  loudness 
increases  when  a critical  band  is  exceeded.  This  allowed  us  to  consider  filtering  techniques 
which  perform  bandwidth  expansion  as  candidates  for  a loudness  enhancement  filter.  The 
pole  displacement  model  was  presented  as  a speech  recognition  modelling  technique  to 
adjust  formant  bandwidths.  It  provided  a better  match  of  noise- free  speech  templates  to 
noisy  reference  templates.  In  chapter  4 we  show  how  this  technique  is  used  in  a speech 
enhancement  filter  to  bandwidth  broaden  formant  poles.  The  filter  is  also  known  as  both  a 
vocoder  post-filter  and  a perceptual  noise  weighting  filter  for  low  bit  rate  coders.  However, 
it  performs  linear  bandwidth  expansion,  and  is  used  only  as  a speech  enhancement  filter  to 
suppress  quantization  noise  and  to  restore  original  formant  bandwidth.  In  order  to  expand 
the  poles  on  a non-lineax  frequency  scale,  we  introduced  a warped  filter  design.  Warped 
filters  have  been  only  used  in  audio  modelling  and  in  prototype  perceptual  vocoder  systems. 
This  thesis  is  the  first  study  known  to  use  the  pole  displacement  model  in  a warped  filter 
design  and  use  waxped  filters  for  speech  enhancement.  This  design  effectively  allows  for 
bandwidth  expansion  of  the  vowel  formants  on  a critical  band  scale.  As  a vocoder  perceptual 
noise  weighting  filter,  it  can  also  be  used  to  generate  an  error  weighting  distribution  on  the 
critical  band  frequency  scale. 


168 


The  implementation  behind  the  warped  filter  and  the  motivation  of  the  Gamma  model 
were  introduced  as  a way  to  approach  filtering  from  a signal  representation  perspective.  By 
interpreting  the  filter  functionality  as  a transformation  of  the  signal  space,  we  no  longer 
need  to  concern  ourselves  about  filter  structures.  In  essence  we  can  consider  changing 
the  representation  of  the  signal  space,  where  the  projection  is  of  the  signal  on  a basis 
function.  The  Laguerre  and  Gamma  filters  were  shown  to  support  this  novel  relation  and 
allow  for  a filter  translation  which  implements  a unique  representation  of  the  frequency 
scale.  In  addition,  the  gamma  filter  is  a basis  of  real  (non-complex)  exponentials  and  is 
more  plausible  as  a functional  operator  in  biological  processes.  Studies  with  the  gamma 
filter  should  be  further  researched  to  clarify  the  biological  inspiration  and  mathematical 
relations  of  the  warped  design  approach. 

In  chapter  5,  objective  measures  were  presented  to  determine  the  loudness  increase 
provided  by  the  warped  filter.  The  total  average  increase  in  vowels  was  seen  to  be  Ny/N^  = 
1.38  times  louder  with  the  warped  filter  in  comparison  to  unprocessed  speech  for  a = 0.5 
with  fs=16KHz  in  Table  5.3.  Results  also  confirmed  that  the  optimal  warping  factor  for 
increasing  vowel  loudness  corresponds  to  critical  band  expansion.  A gain  function  which 
relates  energy  to  effective  loudness  was  presented  to  better  describe  the  concept  of  a loudness 
increase  in  terms  of  an  effective  dB  increase.  The  gain  function  determines  the  scaling  we 
can  apply  to  elevate  the  original  signal  for  equal  loudness  to  the  processed  signal.  In  Table 
5.8  the  gain  function  shows  an  effective  vowel  gain  of  2.1  dB  for  Ny/Nx  — 1.38  with  a 
loudness  error  |1  — Ny/Ngx\  = 0.1  using  the  true  ISO-532B  for  the  warped  filter  with 
a = 0.5.  Speech  recognition  results  were  also  presented  to  objectively  evaluate  the  change 
in  recognition  rates  for  speech  processed  by  the  warped  filter.  DTW  and  continuous  HMM 
results  respectively  demonstrate  a recognition  decrease  of  1%  and  2%  with  the  warped  filter 
in  Tables  5.10  and  5.12. 

Listening  tests  for  the  warped  filter  design  were  presented  to  evaluate  the  improvement 
in  loudness,  the  effect  on  intelligibility,  and  the  subjective  quality  of  the  processed  speech. 
Intelligibility  results  in  Tables  6.2  show  only  a —0.3%  reduction  in  intelligibility  scores  on 
average  for  the  diagnostic  rhyme  test.  Loudness  tests  were  conducted  to  correlate  the  2.1dB 
analytic  gain  of  the  vowel  regions  from  Table  5.8  to  subjective  loudness  gains.  Loudness 


169 


listening  test  results  in  Figure  6.3  showed  an  effective  dB  gain  of  2.0dB  with  the  warped 
filter  design  of  section  4.4  using  critical  band  expansion.  This  result  is  close  to  the  2.1dB 
gain  we  predicted  with  the  analytic  gain  function.  Acceptability  test  results  from  Table 
6.5  compare  the  speech  quality  of  speech  processed  by  the  warped  filter  with  the  original 
speech.  Results  indicate  a slightly  higher  quality  rating  of  1.52  for  the  warped  filter  with 
a = 0.5  than  the  1.65  quality  rating  of  the  original  speech.  Results  also  state  that  of  the 
20  sentences  presented  to  each  listener,  94%  of  those  selected  louder  were  the  processed 
sentences. 

These  results  indicate  the  successful  accomplishment  of  the  stated  objective;  to  design 
a speech  enhancement  filter  which  enhances  perceptual  loudness  without  increasing  signal 
energy.  A complete  review  of  loudness  has  been  provided,  from  the  physiological  perspective 
to  the  psychoacoustic  models.  A thorough  review  of  the  peripheral  auditory  system  and  the 
representation  of  sound  has  been  studied  and  presented.  The  importance  of  acoustic  and 
phonetic  cues  for  speech  discrimination,  intelligibility,  and  quality  in  speech  enhancement 
has  been  investigated.  The  intimate  relations  between  bandwidth,  non-linear  compression, 
the  critical  band,  and  hearing  sensitivity  in  regards  to  loudness  have  been  revealed.  A 
warped  filter  has  been  proposed  to  improve  the  perceptual  loudness  of  speech  without 
increasing  signal  energy  or  degrading  intelligibility.  The  design  has  been  inspired  by  the 
biological  representation  of  loudness  in  the  peripheral  auditory  system.  The  inclusion  of  a 
warped  pole  displacement  model  for  non-linear  bandwidth  expansion  was  motivated  from 
the  critical  band  concept  of  hearing.  Objective  and  subjective  listening  test  results  have 
indicated  a correlated  perceptual  loudness  gain  of  approximately  2dB.  Speech  recognition 
results  indicated  a slight  loss  in  machine  recognition  performance,  but  acceptability  listening 
tests  report  a better  overall  subjective  quality  rating  of  the  loudness  enhancement  method. 

For  future  research,  further  evaluation  of  the  warped  filter  in  noise  conditions  will  be 
required  to  better  assess  the  speech  enhancement  performance.  Noise  will  have  a pronounced 
effect  on  loudness  as  well  as  intelligibility  and  will  most  likely  compromise  performance.  We 
have  revealed  an  intimate  balance  between  speech  loudness  and  intelligibility,  but  have  not 
completely  absolved  the  relation.  Also,  the  use  of  the  warped  filter  as  a perceptual  noise 
weighting  filter  is  extremely  encouraging.  Presently,  all  vocoder  perceptual  weighting  filters 


170 


distribute  noise  weightings  on  a perceptual  amplitude  scale  and  a linear  frequency  scale.  As 
we  know,  the  critical  band  frequency  scale  is  on  a scale  closer  to  that  of  human  hearing  and 
sensitivity.  The  warped  filter  can  be  successfully  used  to  distribute  noise  error  weightings 
not  only  on  a perceptual  amplitude  scale,  but  a perceptual  frequency  scale  as  well. 


170 


distribute  noise  weightings  on  a perceptual  amplitude  scale  and  a linear  frequency  scale.  As 
we  know,  the  critical  band  frequency  scale  is  on  a scale  closer  to  that  of  human  hearing  and 
sensitivity.  The  warped  filter  can  be  successfully  used  to  distribute  noise  error  weightings 
not  only  on  a perceptual  amplitude  scale,  but  a perceptual  frequency  scale  as  well. 


APPENDIX  A 

FILTER  COEFFICIENT  TRANSFORMATION 


In  this  appendix  we  derive  the  binomial  equations  for  the  linear  transformation  and 
demonstrate  the  inclusion  of  an  off-axis  radius  term  for  bandwidth  expansion.  The  all-pass 
element  can  be  decomposed  into  a polynomial  in  — az~^) 

,_i  — a 

^ = I ^ 

1  — az  ^ 

_ z~^  — a^z~^  — a + a^z~^ 

= (1 
= (1 


(1  — az~^) 

z~^  (1  — az~^) 

1 — az~^  (1  — az^^) 


The  lag-free  term  must  be  accounted  for,  and  all  terms  delayed  equally.  Substituting 
(0  = (1  — a^)  and  y = z~^ l{\  — az~^)  we  get 

z~^  — Py  — a 


Including  this  in  the  all-pole  model  and  extending  terms  for  p = 4 for  illustration  we  get 


A(z) 


p 

00  = 1 

k=0 

k=0 


Expanding  the  summation  from  k = 0...3  we  get  the  following  elements  of  summation 


k 

k 

k 

k 


0 ao[Py-af 

1 ^ ai[Py-a]^ 

2 02[/3y  - a]^  = 02[/3^y^  - 2/9ya -I- a^] 

3 -)■  aslPy  - af  = aslP^y^ -ZP'^y'^a  + SPya^  - a^] 


171 


172 


In  matrix  form  the  columns  represent  the  coefficients  of  a polynomial  in  y.  The  column 
summations  correspond  to  the  bk  coefficient  terms.  The  linear  transformation  is  equiva- 
lent to  multiplication  with  the  fixed  triangular  matrix.  The  triangular  matrix  allows  the 
recursive  algorithm  presented  in  the  following  section. 


Oo  ui  a>2  031  • 


bzy^  b2V^  biy^  6oy° 


Oy^ 

0y2 

Oy 

-l-a* 

Oj/3 

O 

to 

+/5y 

—a 

Oy^ 

+0W 

-2(3  ay 

-fa' 

—‘i0^ay^ 

-\-2>0a^y 

—a' 

The  binomial  equations  solve  the  triangular  matrix  set  of  equations.  The  b^  coefficients  are 
generated  by  a linear  transform  of  the  prediction  coefficients  ak, 


p 

bk  ~ ^ ] Ckn^ni 
n=k 


(1) 


For  validity  we  complete  the  calculations  k — 0...3.  We  see  each  of  the  bk  terms  correspond 
to  the  column  summations  over  o„  of  the  binomial  matrix. 


3 

n=0 

o o o o 

CO  to  o 

II  II  II  II 

(l)/3°(-a)«  = -H 
(l)/?°(-o)^  = -a 
(l)/?‘>(-a)2  = +a2 

(l)/?0(-a)3  = -a^ 

EOn 

n=l 

C-u  = 

C\2  — 

Cl3  = 

(1) ^i(-a)°  = +0 

(2) /?i(-a)i  = -20a 

(3) /3^(— a)^  = 30a^ 

Yj 

n=2 

C22  = 
C23  = 

= 0^ 

{3)0^{—a)^  = —30^a 

Ec,„ 

n=3 

C33  = 

{l)0H-a)°  = 0^ 

In  order  to  include  a bandwidth  expansion  term,  it  is  necessary  to  know  where  the  term 
should  be  included.  The  all-pole  model  (if  we  take  the  inverse  of  both  sides)  is  given  by 


e-”""  (2) 

fc=0 

A bandwidth  broadening  technique  effectively  evaluates  the  z transform  on  a circle  greater 
than  the  unit  circle.  The  new  evaluation  circle  can  be  expressed  as  a function  of  the  radius 
r, 


A:=0 


(3) 


173 


It  can  also  be  interpreted  as  the  z transform  of  a power  series  scaling  of  the  ak  coefficients. 


A{z) 

A{z) 

A{z/j) 


= *")  e 


—jvjk 


k=0 

ao  + + a2r~^z~^  + a^r~^z~^  + .... 


— oo  + fli  I — 
\7 

= ^(2/7) 


+ 02  - 


+ O3 


-3 


+ .... 


Since  the  prediction  coefficients  are  scaled  by  a power  series  of  r it  is  necessary  to  include 
this  in  the  binomial  evaluation  matrix 


|aor  ° air  ^ 02r  ^ a^r  • 


0y3 

0y2 

Oy 

Oy^ 

0y2 

+/3y 

—a} 

Oy^ 

+^^ 

-Way 

w 

+3/3a^y 

~o? 

bsy^ 

b2y'^ 

b\y^ 

bo/ 

This  multiplication  translates  to  a power  series  scaling  of  like  polynomial  terms  in  y. 
Thus,  each  column  is  scaled  by  a power  series  of  r.  This  translates  to  the  inclusion  of  the 
power  series  in  the  Ckn  coefficients  of  the  binomial  equation,  as  shown  below. 


p 

bk  ~ ^ ^ C’fcn^n) 

n=k 


Ckn  — 


(1  - a2)'=(-a)"-'=r-" 


(4) 


In  this  section  we  derive  the  recursion  for  the  binomial  equation  and  the  inclusion  of  the 
radius  term  for  warped  bandwidth  expansion.  The  triangular  matrix  can  be  represented  by 
a recursion  and  is  given  a.s 


bp  — s.p 

for  n = \...p  — 1{ 

bp—n  — ^p—n  n+1 

if  n>  1{ 

for  k = p — n + l...p  — 1{ 

bk  = {l  - a^)bk  - o;6fc+i}} 

bp  = {I  - a^)bp} 


174 


where  a represents  the  predictor  coefficients.  Setting  up  the  binomial  terms  after  each 
iteration  n we  get 

Tl  — 1 bp  — Ojp 


bp—l 

— cxbp 

n = 2 bp-2 

— flp_2  cxbp—i 

bp—i 

— Pbp—i  Oibp 

bp 

^ 0bp 

n = 3 bp-3 

— flp_3  O.bp—2 

-^bo 

bp— 2 

— ^6p_2  cxbp^i 

^bx 

bp—i 

~ l^bp—i  cxbp 

—>■  62 

bp 

= Pbp 

->63 

Solving  for  the  final  bk  values  with  the  recursive  substitutions 
63  - ^b^?  =ma^  = !^^a^ 

62  = l^b^plx  - otbf^ 

= /3[/3(ap_i  -abp)-  a/36p]  - a0^bp 

= /3[/0O2  - 0!^03  - 0/303]  - OL0^az 
= 0^tt2  - ap'^as  - aP'^as  - aP'^ai, 

= /3^02  - ZP'^aa^ 

hi  = 

= (3|op-2  - atpl’il  - al/36j,'J,  - aij,'’ 

= P[ai  - 0(02  - 003)]  - a[/3(o2  - 003)  - aPaz] 

= /3oi  — otPa2  + a^Pa^  — a/3a2  + oPPa^  + oPPas 

— /3oi  — 20/002  + lia^Pas 

bo  = ap-3  — otb^pl2 

= Oo  - o[oi  - o6p^Ji| 

= Oo  - o[oi  - o[o2  - OO3]] 

= Oo  — OOl  + 0^02  — 0^03 

we  see  the  final  bk  coefficients  match  the  columns  in  the  binomial  matrix.  Previous  results 
show  that  the  matrix  columns  must  be  scaled  by  a power  series  of  r.  By  carefully  tracing 
the  origin  of  the  recursions,  we  can  place  the  r term  in  such  a position  that  effectively  allows 
this  to  take  place.  For  this  to  happen,  r must  be  placed  at  the  first  occurence  of  the  o and 
P values  in  the  recursion.  We  provide  the  same  analysis  as  above  except  this  time  with  the 
inclusion  of  the  radius  term.  The  recursion  correctly  propagates  r with  a scaling  factor  of 
each  a„  equal  to  a power  of  r given  by  the  predictor  coefficient  index  n. 


175 


bp  — ftp 

for  n = l...p  — 1{ 

bp—n  — ^p—n  ~ r Otbp—ji-t-i 
if  n>  l{ 

for  k = p — n + l...p  — 1{ 

bk  = r~^(l  - a^)bk  - r“^a6fc+i}} 
bp  = r~^(l  - a^)bp} 


following  through, 


II 

e 

bp 

(Zp 

bp—i 

— Qip — j JfCxbp 

n = 2 

bp-2 

— flp_2  TOibp—i 

bp—i 

= rpbp-i  - rabp 

bp 

= rPbp 

n = 3 

bp— 3 

Up_3  VOcbp^2 

— > bo 

bp-2 

= r^6p_2  - ro6p_i  - 

^b, 

bp—i 

= rPbp-i  - rabp 

— > b2 

bp 

= rPbp 

— > bo 

Solving  for  the  final  b^  values  with  the  recursive  substitutions 
63  = = rPrPb^'^  = r/3r/3r,0a3  = 

62  = ^Pb^pli  — vabf^ 

= F;0[r/36p^i  “ ra6p^^]  — rar,06p^^ 

= r/3[r/3(ap_i  - vabp)  - rar^bp]  — rar0‘^bp 
= r/3[r^a2  ~ raT^a^  — rar^aa]  — rarp'^a^ 

= r^0^a2  — rar/3^03  — ror/S^oa  — rar/3^aa 
= r^/3^02  — 3r‘^l3^aa3 

bi  = rPb’^pl^  - rabf\ 

= rP[ap-2  - rab^p\]  - ra[rPbpl^  ~ rab^^^ 

= rP[ai  - ra[a2  - raaa)]  - ra[r/3(o2  - raaa)  - rar^aa] 

= r/3ai  — r^a/3o2  + - r^a/3a2  + r^a^Pas  + r^a^jSaa 

= rPai  — 2r^aPa2  + 3r^a^/0aa 

= ap-3-rab^pl2 
= oo  - ra[ai  - J 
= Oq  — ra[oi  ~ ro;[a2  — raoa]] 

= Oo  — raui  + r^a^02  — r^a^oa 


b^ 


APPENDIX  B 
WARPED  PHASE 


:-l 


Q-jw 


(,-jw 

g-j«> 

tan(— tw) 


w 


z — a 
1 — 


-1  < a < 1 


-jw 


a 


1 


- ae 
+ a 


1 + 

+ a)  {1  + ae+^^) 

{I  + ae~^'^){l  + ae+^^) 

_ e“-^“  + a + a + 

1 + ae+J^  + ae~^'"’  + 

_ cos  w + cos  w + 2a  + j (a^ 


C 

{a^  — 1)  sinu; 

(a^  + 1)  cos  w 2a 

(1  — a^)  sinw 


= arctan 


(q!^  + 1)  cos  w + 2a 


1)  sinu; 


176 


APPENDIX  C 
HMM  TRAINING 


The  HMM  is  a widely  used  statistical  method  for  characterizing  the  spectral  properties  and 
pattern  sequences  of  speech  frames.  The  HMM  is  based  on  the  ideas  of  statistical  signal 
modelling  and  is  a parametric  method  employed  for  the  pattern  recognition  of  speech.  The 
premise  of  an  HMM  is  that  the  speech  signal  can  be  characterized  as  a parametric  random 
process,  and  that  the  parameters  of  the  model  can  be  determined  in  a well  defined  sense. 
The  Markov  model  uses  a stochastic  process  of  state  transitions  to  generate  an  output 
sequence  or  model  an  input  sequence.  The  Hidden  Markov  Model  is  a doubly  stochastic 
process  in  which  a second  stochastic  process  contributes  a transmission  probability  which 
governs  the  generation  of  the  observed  time  series.  The  underlying  state  sequence  is  hidden 
by  the  secondary  stochastic  process.  Thus  there  are  two  main  components  to  an  HMM:  the 
finite  state  sequence  and  the  output  probability  distributions  (also  referred  to  as  emission 
probabilities).  The  Markov  chain  synthesizes  the  state  sequence,  and  the  state  transmission 
probabilities  together  create  the  observed  time  series.  The  observable  time  series  provides 
evidence  about  the  hidden  path  and  the  model  parameters  which  generated  that  path 
sequence. 

The  three  fundamental  problems  in  the  complete  model  specification  of  an  HMM  are: 

1.  How  to  efficiently  compute  P{0  \ A),  the  probability  of  the  observation  sequence  O 
given  the  model,  provided  A and  the  observation  sequence  O = (oi02...or) 

2.  How  to  determine  a state  sequence  q = {q\q2---qT)  that  best  explains  the  given  obser- 
vation sequence  O and  model  A. 

3.  How  to  adjust  the  model  parameters  A to  maximize  P{0  \ A),  the  observation  sequence 
given  the  model. 

The  notation  defined  for  the  model: 

N = number  of  states,  where  the  state  at  time  t is  qt 
M = number  of  distinct  observation  symbols 
T = number  of  observed  symbols 

V = {v\,  ...Vp4'\  set  of  observation  symbols  for  discrete  HMM 
■Ki  = initial  probability  of  state  i 

A = (ajj)  where  is  the  probability  of  being  in  state  j at  time  t -\-  \ given  that  we  were 
in  state  i at  time  t. 

D = [bj{k)}  probability  of  observing  symbol  Vk  given  we  are  in  state  j. 

Ot  = {ot}  observation  symbol  observed  at  time  t 
A = (A,  D,  7t)  HMM  model  parameters 


177 


178 


Probability  Evaluation 

This  probability  evaluation  satisfies  the  question  of  how  well  a given  model  matches  a given 
observation  sequence.  This  is  the  classification  process  which  ultimately  determines  which 
HMM  best  represents  the  spoken  utterance. 

P(0  I A)  = ^P(0,q  I A)  = J]P(0  I q,A)P(q  | A)  (5) 

q q 

The  product  in  Equations  represents  the  joint  probability  that  the  Markov  chain  sequence 
P(q  I A)  and  the  state  transmissions  P(0  | A | q,  A),  given  that  state  sequence,  occur 
simultaneously.  The  summation  extends  the  joint  probability  of  one  particular  path  to  the 
full  probability  of  all  possibly  occurring  state  paths.  The  calculation  of  P(0  | A)  is  most 
efficiently  performed  by  the  forward  and  backward  procedures. 

Forward  procedure  Backward  procedure 

N N 

oit{j)  = [o(t-i{i)aij]  bj{ot+i)  Pt{i)  = ^ aijbj{ot+i)/3t+iU) 

i=l  j=l 

Optimal  State  Sequence 


Given  an  observation  sequence  and  model,  it  is  necessary  to  determine  which  of  all  possible 
paths  generate  the  greatest  probability,  P(q  | 0,A).  The  best  state  sequence  is  that  path 
which  maximizes  P(q  | O,  A),  and  is  in  equivalence  to  maximizing  P(q,  O,  | A).  The  Viterbi 
algorithm  is  an  inductive  algorithm  which  preserves  the  maximally  probable  path  at  each 
time  instant.  It  is  a dynamic  minimization  method  that  propagates  the  best  path  for  each  of 
the  N states  and  provides  a final  path  probability  for  each  state.  The  maximum  probability 
of  the  states  maxi^P{ci  \ O)  is  chosen  as  the  optimal  path  sequence. 

Parameter  Estimation 


Given  an  observation  sequence  it  is  necessary  to  solve  the  set  of  model  parameters  which 
provide  the  best  probability  of  actually  observing  that  sequence.  Optimal  methods  of 
probability  maximization  are  necessary  since  a closed  form  solution  does  not  exist.  The 
Expectation  Maximization  technique  (Baum- Welch)  is  a Maximum  Likelihood  estimate 
which  iteratively  maximizes  P{0  \ A)  from  the  observed  time  series.  A description  of  the 
process  requires  the  definition  of  a variable,  for  the  probability  of  being  in  a state  i 

at  time  t and  state  j at  time  t -t-  1. 


at{i)aijbj{ot+i)Pt+i{j) 

N N 

E T,  oit{i)aijbj{ot+i)l3t+iij) 

i= I / 


(6) 


With  the  aid  of  the  forward  and  backward  algorithms  an  interpretation  of  the  inter-state 
transitions  and  emission  probabilities  has  been  provided.  The  re-estimation  parameters, 
A,  are  iteratively  used  to  maximize  the  expected  probability  of  the  observation  sequence 
P{0  I A).  They  are; 


179 


TTj  = expected  frequency  (number  of  times)  in  state  i at  time  t = 1 (7  = 1) 

r-i 

expected  number  of  transitions  from  state  i to  state)  t=i 


O'ij  — 


expected  number  of  transitions  from  state  i 


T-l 


t=i 


T-I 


hj{k)  = 


7t(i) 

expected  number  of  times  in  state)  and  observing  symbol  vk  t=i,o,=Vk 


expected  number  of  times  in  state) 


T-J 


t=i 


The  EM  technique  provides  the  recovery  of  the  hidden  state  sequences  and  is  a special 
case  of  the  Maximum  Likelihood  technique. 


HMM  Training 

The  re-estimation  parameters  and  EM  algorithm  provide  a means  by  which  the  model 
parameters  can  be  adjusted  so  as  to  maximize  the  probability  of  observing  the  times  series 
signal.  The  alternative  viterbi  algorithm  can  also  be  used  to  define  the  optimal  state  path 
and  to  update  the  parameter  estimates.  The  EM  algorithm  has  stability  and  scaling  issues, 
is  very  complex,  and  leads  to  many  local  maxima,  and  so  only  an  implementation  of  the 
viterbi  algorithm  is  preferred  for  re-estimation  and  presented  for  this  research.  The  training 
of  a HMM  is  conducted  as  follows: 


1.  A speech  utterance  is  segmented  into  N frames.  Each  frame  is  processed  into  a feature 
vector  to  represent  one  observation  unit.  Collectively  the  concatenated  observation 
units  represent  the  observation  vector  for  the  speech  utterance. 

2.  Typically  each  word  utterance  is  represented  by  one  HMM  model.  A training  set  of  M 
different  utterances  are  used  to  train  each  HMM  to  be  word  specific  to  the  presented 
utterance.  A vocabulary  of  10  words  for  a speech  recognition  system  would  require 
10  independently  trained  HMMs.  Thus  HMMs  are  trained  non-discriminantly. 

3.  Linear  segmentation  is  performed  on  each  observation  vector  to  initialize  training 
state  distributions.  The  number  of  observation  units  in  each  observation  vector  is 
divided  by  the  number  of  states  to  determine  the  initial  occupancy  of  each  state.  The 
observation  units  are  then  linearly  assigned  in  order  to  the  corresponding  states  based 
on  their  occupancy. 

4.  The  inter-state  transition  probabilities.  A,  are  calculated  from  the  state  assignments 
of  each  observation  unit.  A feed  forward  (Bakis  model)  assumes  no  state  skip  and 
starts  in  state  1.  In  this  model  all  initial  probabilities  tt^  except  state  one  vri  are  0. 

5.  The  emission  probabilities  are  calculated  from  the  given  observation  sequence,  initial 
and  inter-state  probabilities,  and  initial  state  assignments. 

The  are  two  HMM  types  for  determining  the  emission  probabilities.  The  discrete  HMM 
assumes  a discrete  observation  symbol  for  each  observation  unit  and  generates  a matrix 
B of  the  discrete  probability  distributions.  The  continuous  HMM  provides  a continuous 


180 


probability  function  and  assumes  a parametric  distribution.  The  following  sections  present 
a demonstration  of  the  two  methods.  Once  the  A,  B and  TTj  parameters  are  determined,  the 
iterative  Viterbi  algorithm  is  presented  in  the  final  section  to  re-estimate  the  optimal  path 
sequences  and  complete  training. 

Discrete  Density  Models 


1.  A codebook  for  each  word  is  created  using  the  M utterances  provided  for  each  word 
in  training.  Or,  one  large  codebook  can  be  used  to  represent  the  entire  vocabulary  of 
all  words  presented.  This  is  also  a type  of  Vector  Quantization.  Codebook  sizes  range 
from  256-1024.  Each  codeword  in  the  codebook  represents  a closest  fit  to  each  of 
the  continuous  obser-vation  elements.  Each  observation  element  is  a D length  vector 
(D  being  the  dimension  of  the  feature  vector).  A closest  fit  means  closest  within  a 
minimum  distortion  measure,  such  as  a Euclidean  distance  (the  Euclidean  distance 
between  cepstral  coefficients  is  equivalent  to  the  rms  log-spectral  distortion  distance). 
Since  our  hearing  sensitivity  to  loudness  is  on  a log  spectral  scale  a corresponding 
measure  of  dissimilarity  is  the  cepstral  distance. 


0l  o2  o3  o4  05  o6  o7  o8  o9 
©©©©©©©© 
®®©©®®©©® 


CODEBOOK  1 


2.  Each  frame  is  assigned  to  a codeword  in  the  codebook.  Each  codeword  is  represented 
by  an  index  (one  to  the  number  of  codewords).  Whichever  codeword  is  closest  to 
the  feature  frame  will  assign  it’s  index  value  to  that  feature  frame.  The  index  to 
each  codeword  in  each  codebook  is  analogous  to  the  observation  vector.  Thus,  the 
observation  vector  is  not  a concatenated  sequence  of  continuous  observation  elements, 
but  a sequence  of  discrete  representations  for  the  word  utterance.  Thus,  each  frame 
is  represented  by  a single  real  number  corresponding  to  the  index  of  the  codeword 
in  the  codebook.  Multilabeling  is  a technique  to  increase  the  effective  training  set 
size.  Instead  of  assigning  the  closest  codeword  to  the  observation,  additional  training 
seqnences  from  the  2nd  best,  3rd,  or  Lth  closest  codewords  are  also  considered.  If 
the  codebook  is  derived  from  the  K-means  clustering  algorithm,  code-words  close  in 
distance  will  also  be  close  in  index  value. 


3.  The  discrete  observation  vector  is  linearly  segmented  along  with  its  new  alleles  (addi- 
tional training  sequences  created  from  multilabeling)  into  the  available  set  of  states. 


181 


o1  o2  o3  o4  o5  o6  o7  o8  o9 

o © ©I®  © ©I©  © © 

© © ©1©  © ©i®  © © 


state  '•  state  2 state  3 

3/6  3/6  0/6  0/6  0/6  0/6  0/6  4/6  1/6  1/6  0/6  1/6  0/6  3/6  1/6 


l_l 


J 


k1  k2  k3  k4  k5  k1  k2  k3  k4  k5  k1  k2  k3  k4  k5 


4.  The  discrete  probability  distribution  B and  the  interstate  transition  probabilities  A are 
determined.  Matrix  B is  a set  of  discrete  pdf’s  for  each  state.  It  contains  a probability 
vector  for  each  state  with  a length  corresponding  to  the  number  of  codewords  in  the 
codebook.  The  elements  of  the  B vector  correspond  to  the  enumerated  index  values  of 
the  codewords.  The  value  at  each  element  of  the  B vector  represents  the  probability 
of  seeing  that  codeword,  by  index,  in  that  state.  For  each  element,  it  is  simply  the 
ratio  of  the  number  of  times  that  codeword  is  seen  to  the  number  of  codewords  in 
that  state.  The  inter-state  transition  probabilities  are,  similarily,  a set  of  calculated 
ratios. 


5.  Now  that  a discrete  probability  function,  B,  and  interstate  transition  probability  ma- 
trix, A ,have  been  estimated  for  each  state,  a maximum  likelihood  path  through  the 
state  lattice  can  be  determined.  For  each  observation  vector,  the  EM  or  Viterbi  al- 
gorithm is  used  to  determine  the  optimal  state  sequence.  The  optimal  state  sequence 
provides  the  most  likely  estimate  of  the  correct  sequence  for  that  particular  observa- 
tion vector  (Maximum  likelihood). 


182 


o1 

o2 

o3 

o4 

05 

o6 

o7 

o8 

Final  State 
o9 

O 

O 

O 

O 

O 

O 

O 

O 

O 

O 

O 

O 

O 

O 

O 

O 

O 

O 

« — O Max 

6.  Wherever  the  optimal  state  sequence  conflicts  with  the  assigned  state  sequence,  that 
state  value  is  reassigned  to  the  optimal  state  value.  This  is  done  for  all  observa- 
tion vectors  and  for  all  training  words.  The  process  returns  to  step  4)  . Once  the 
optimal  state  sequences  corresponds  to  the  assigned  state  sequences,  the  training  is 
complete.  Or,  training  can  be  halted  at  a certain  percentage  of  error  in  the  optimal 
state  sequence. 


91 

1 

1 


02  03  04 


2 

2 


1 1 
1 [2 


05  06  07 


3 

3 


08  09 


3 3 

3 3 


obsevations 

old  state  sequence 
reassigned  sequence 


Continuous  Density  Models 

The  difference  between  continuous  density  and  discrete  density  HMMs  is  the  representation 
of  the  probability  distribution  functions.  In  a discrete  model,  the  pdf  is  a discrete  sequence 
of  codeword  probabilities,  resembles  a histogram  of  nonparametric  characterization.  In  a 
continuous  model,  the  pdf  is  continuous  and  defined  by  the  model  probability  parameters. 


o1  o2  o3  o4  o5  o6  o7  o8  o9 


When  a gaussian  function  is  used,  the  model  parameters  are  the  state  means  and  co- 
variances.  In  a continuous  model  the  B matrix  is  the  set  of  gaussian  functions  for  each 
state.  In  a continuous  density  single  gaussian  model  only  one  gaussian  per  state  is  used.  A 
figure  of  a continuous  density  gaussian  mixture  provides  more  modeling  flexibility  due  to 
its  multi-modal  nature. 

1.  The  mean  vectors  (with  dimension  of  the  feature  vector)  and  covariance  matrices 
are  calculated  for  each  state.  The  mean  vector  for  each  state  is  simply  the  mean  of 
the  residing  feature  vectors  in  that  state.  The  covariance  matrix  for  each  state  is 
the  covariance  matrix  of  the  feature  vectors  in  that  state.  Diagonal  covariances  are 
preferred  for  computational  simplicity  and  the  knowledge  that  with  limited  training 


183 


State  1 observations 


state2  observations 


States  observations 


the  covariances  will  not  be  reliable  true  estimates.  Assumption  of  statistical  inde- 
pendence between  cepstral  coefficients  supports  the  simplification  as  well.  The  state 
mean  and  covariance  provide  the  gaussian  pdf  in  Equation(7)  for  calculation  of  the 
forward  probabilities. 


1 


P (x)  = 


(27t)^/2v^ 


(7) 


2.  The  initial  state  distributions  and  gaussian  pdf  are  used  to  determine  an  optimal 
resequencing  order.  For  each  observation  vector,  the  EM  or  Viterbi  algorithm  is 
used  to  determine  the  optimal  state  sequence.  The  optimal  state  sequence  provides 
the  most  likely  estimate  of  the  correct  sequence  for  that  particular  observation  vector 
(Majcimum  likelihood)  given  the  input  sequence.  Wherever  the  optimal  state  sequence 
conflicts  with  the  assigned  state  sequence,  that  state  value  is  reassigned  to  the  optimal 
state  value.  This  is  done  for  all  observation  vectors  of  each  word  for  that  HMM  model. 
The  training  process  returns  to  step  4).  Once  the  optimal  state  sequences  corresponds 
to  the  assigned  state  sequences,  the  training  is  complete.  Or,  training  can  be  halted 
at  a certain  percentage  of  the  optimal  state  sequence  error.  This  is  then  performed 
for  all  HMM  models. 


Ol 

o2 

o3 

o4 

o5 

o9 

p[o2] 

^[o2] 

p[o3] 

O 

p[o3] 

o 

O 

Final  State 
^ O PI  [o9] 

71  [2]  cr  / 

O 

O 

o 

O 

O P2[o9] 

a31  p[o2] 
71  [3]  O'  O 


p[o3] 

O 


O 


o 


O P3[o9] 


REFERENCES 


[1]  A.  Acero  and  R.M  Stern.  Environmental  robustness  in  automatic  speech  recognition. 
ICAASP  Conference,  2:849-852,  1990. 

[2]  A.  Agrawal  and  W.C.  Len.  Aspects  of  voiced  speech  parameters  on  the  intelligibility 
of  Peterson  Barney  words.  J.  Acoust.Soc.  Am,  57(l):217-222,  1975. 

[3]  E.  Ambikairajah,  A.G.  Davis,  and  W.T.  Wong.  Auditory  masking  and  MPEG-1  audio 
compression.  Electronic  and  Communication  Engineering  Journal,  94:165-175,  1997. 

[4]  C.  Asavathiratham,  P.E.  Beckmann,  and  A.V.  Oppenheim.  Frequency  warping  in 
the  design  and  implementation  of  fixed-point  audio  equalizers.  IEEE  Workshop  on 
Applications  of  Signal  Proc.  and  Acous.,  New  York:55-58,  October  17-20  1999. 

[5]  M.  Boillot.  Performance  of  a cepstral  based  speech  recognition  algorithm.  Motorola 
Speech  and  Audio  Quality  Summit  Proceedings,  pages  1-6,  1999. 

[6]  M.  Boillot,  S.  Shah,  and  P.  Doran.  An  algorithm  testing  environment  for  speech  recog- 
nition. Motorola  Systems  Symposium:  Broward  County  Convention  Center,  pages 
8-14,  1999. 

[7]  M.  Boillot,  S.  Shah,  and  P.  Doran.  A low  memory  speech  recognition  algorithm  in 
the  i600.  Motorola  Internal  Confidential  report,  pages  1-6,  1999. 

[8]  S.F.  Boll.  Suppression  of  acoustic  noise  in  speech  using  spectral  subtraction.  IEEE 
Transactions  on  Acoustics,  Speech,  and  Signal  Processing,  27:113-120,  April  1979. 

[9]  M.C.  Botte,  S.  Charron,  and  H.  Bouayad.  Temporary  threshold  and  loudness  shifts: 
Frequency  patterns  and  correlation.  J.  Acoust.Soc.  Am,  93  (3):1524-1534,  1998. 

[10]  Karl  Brandenburg.  ISO-MPEG-1  Audio:  A generic  standard  for  coding  high  quality 
digital  audio.  Electronic  and  Communication  Engineering  Journal,  94:165-175,  1997. 

[11]  A.T.  Cacace  and  R.H.  Margolis.  On  the  loudness  of  complex  stimuli  and  its  relation- 
.ship  to  cochlear  excitation.  J.  Acoust.  Soc.  Am,  78  (5):1568- 1573,  1985. 

[12]  O.  Cappe.  Elimination  of  the  musical  noise  phenomenon  with  the  Ephraim  and  Malah 
noise  suppressor.  IEEE  Transactions  on  Speech  and  Audio  Processing,  2(2):345-349, 
1994. 

[13]  S.  Celebi  and  J.C.  Principe.  Parametric  least  squares  approximation  using  Gamma 
bases.  Technical  Report  No.  EDICS  SP  2.3.1,  pages  1-21,  1994. 

[14]  Samel  Celebi.  Representation  of  locally  stationary  signals  using  low-pass  moments. 
University  of  Florida,  Ph.D.  Thesis,  1995. 


184 


185 


[15]  J.H.  Chen  and  A.  Gersho.  Adaptive  postfiltering  for  quality  enhancement  of  coded 
speech.  IEEE  Trans,  on  Speech  and  Audio  Proc.,  3 (1):59-71,  1995. 

[16]  J.H.  Chen,  N.  Jayant,  and  R.V.  Cox.  Improving  the  performance  of  the  16-KBs 
LD-CELP  speech  coder  . IEEE  Conference.,  pages  I69-I72,  1992. 

[17]  Y.M.  Cheng  and  D.  O’Shaughnessy.  Speech  enhancement  based  conceptually  on 
auditory  evidence.  IEEE  Transactions  on  Signal  Processing^  39:1943-1954,  1991. 

[18]  V.  Colot te  and  Y.  Laprie.  Automatic  enhancement  of  speech  intelligibility.  ICAASP, 
pages  1057-1060,  2000. 

[19]  B.  de  Vries  and  J.C.  Principe.  The  Gamma  Model-A  new  neural  model  for  temporal 
processing.  Neural  Networks,  5:565-576,  1992. 

[20]  Bert  de  Vries.  Temporal  processing  with  neural  networks-The  development  of  the 
Gamma  Model.  University  of  Florida,  Ph.D.  Thesis,  1991. 

[21]  John  R.  Deller,  John  G.  Proakis,  and  John  H.L.  Hansen.  Discrete-Time  Processing 
of  Speech  Signals.  Maxwell  Maxmillan,  Toronto,  Canada,  1993. 

[22]  Paul  Denisowski.  How  does  it  sound?  IEEE  Spectrum,  pages  60-64,  2001. 

[23]  S.  Dimolitsas,  F.  Corcoran,  and  M.R.  Baraniecki.  Transmission  quality  of  North 
American  cellular,  personal  communications,  and  public  switched  telephone  networks. 
IEEE  Trans,  on  Vehicular  Technology,  43(2):245-251,  May  1994. 

[24]  S.  Dimolitsas,  F.  Corcoran,  and  M.R.  Baraniecki.  Dependence  of  opinion  scores  of 
listening  sets  used  in  degradation  category  based  assessments.  IEEE  Trans,  on  Speech 
and  Audio  Proc.,  3(5):421-424,  Sept.  1995. 

[25]  J.R.  Dubno  and  M.F.  Forman.  Effects  of  spectral  flattening  on  vowel  identification. 
J.  Acoustic.  Sob.  Am,  82  (5):1503-1511,  Nov.  1987. 

[26]  B.  Edler  and  G.  Schuller.  Audio  coding  using  a psychoacoustic  pre-  and  post- filter. 
IEEE,  pages  881-884,  2000. 

[27]  Y.  Ephraim  and  D.  Malah.  Speech  enhancement  using  a minimum  mean-square  error 
short-time  spectral  amplitude  estimator.  IEEE  Transactions  on  Acoustics,  Speech, 
and  Signal  Processing,  32:1109-1121,  1984. 

[28]  G.  Evangelista.  The  short-time  Laguerre  transform:  A new  method  for  the  real-time 
frequency  warping  of  sounds  . Proc.  of  ICMC  2000,  Berlin,  Germany:380-383,  Aug 
2000. 

[29]  G.  Evangelista.  Real-time  varying  frequency  warping  via  short-time  Laguerre  trans- 
form. Proc.  of  the  Conference  of  Digital  Audio  Effects,  (DAFX-00  G-6),  Verona, 
Italy:  1-6,  Dec  7,  2000. 

[30]  G.C.M.  Fant.  Acoustic  Theory  of  Speech  Production.  Mouton  and  Company,  New 
York,  1960. 

[31]  H.  Fletcher  and  W.J. Munson.  Loudness,  its  definition,  measurement,  and  calculation. 
J.  Acoust.  Soc.  Am,  5:82-108,  1933. 


186 


[32]  Harvey  Fletcher.  Speech  and  Hearing  in  Communication.  Acoustical  Society  of  Amer- 
ica (reprint),  Woodbury  N.Y.,  1953. 

[33]  C.R.  Galand,  J.E.  Meuez,  and  M.M.  Rosso.  Adaptive  code  excited  linear  prediction. 
IEEE  Transactions  on  Signal  Processing,  40  no  6:1317-1326,  June  1992. 

[34]  I. A.  Gerson  and  M.A.  Jasiuk.  Techniques  for  improving  the  performance  of  CELP- 
Type  speech  coders.  IEEE  Journal  on  Selected  Areas  of  Communications,  10(5):858- 
865,  1992. 

[35]  O.  Ghitza.  Adequacy  of  auditory  models  to  predict  human  internal  representation  of 
speech  sounds.  J.Acoust.  Soc.  Am.,  93(4):2160-2171,  1993. 

[36]  O.  Ghitza.  Processing  of  spoken  CVGs  in  the  auditory  periphery  I:  Psychophysics. 
J.Acoust.  Soc.  Am,  97(2):.2507-2516,  1993. 

[37]  O.  Ghitza.  Auditory  models  and  human  performance  in  tasks  related  to  speech  coding 
and  speech  recognition.  IEEE  Trans  on  Speech  and  Audio  Processing,  2(1),  1994. 

[38]  O.  Ghitza.  On  the  perceptual  distance  between  speech  segments.  J.Acoust.  Soc.  Am, 
101(l):522-529,  1996. 

[39]  O.  Ghitza  and  M.M.  Sondhi.  On  the  perceptual  difference  between  speech  segments. 
J.Acoust.  Soc.  Am.,  101(l):522-529,  1997. 

[40]  B.R.  Glasberg  and  B.C.  Moore.  Derivation  of  auditory  filter  shapes  from  notched 
noise  data.  Hearing  Research,  47:103-138,  1990. 

[41]  B.  Gold  and  N.  Morgan.  Speech  and  Audio  Signal  Processing.  John  Wiley  & Sons, 
Inc.  New  York,  2000. 

[42]  M.  Goldstein.  Classification  of  methods  used  for  assessment  of  text-to-speech  synthesis 
according  to  demands  placed  on  the  listener.  Speech  Communication,  16:225-244, 
1995. 

[43]  M.  Gordon  and  W.E.  O’Neil  W.E.  Temporal  processing  across  frequency  channels  by 
fm  selective  auditory  neurons  can  account  for  fm  rate  selectivity.  Hearing  Research, 
122:97-108,  1998. 

[44]  D.M.  Green.  Profile  Analysis.  Oxford  Science  Publication,  1988. 

[45]  D.M.  Green  and  T.G.  Forrest.  Temporal  gaps  in  noise  and  sinusoids.  J.  Acoust.Soc. 
Am,  86:961-970,  1989. 

[46]  C.  Guo,  H.  Xiulin,  Z.  Yunyu,  and  Z.  Ting.  A modified  Itakura  speech  distortion 
measure.  IEEE  Trans  on  Speech  and  Audio  Processing,  8(2):105-114,  March  2000. 

[47]  J.H.  Hansen  and  M.A.  Clements.  Constrained  iterative  speech  enhancement  with 
application  to  speech  recognition.  IEEE  Transactions  on  Signal  Processing,  39:795- 
805,  April  1991. 

[48]  Harma  and  U.K.  Laine.  A comparison  of  warped  and  conventional  linear  predictive 
coding.  IEEE  Transactions  on  speech  and  audio  processing,  9 no.5:579-588,  2001. 


187 


[49]  Aki  Harma.  Implementation  of  recursive  filters  having  delay  free  loops.  Proc  IEEE 
Int.  Conf.  Acoustics,  Speech,  Signal  Processing,  3:pp.  1261-1264,  May  1998. 

[50]  William  Hartmann.  Signals,  Sound,  and  Sensation.  Springer,  New  York,  1998. 

[51]  M.  Hauenstein.  A computationally  efficient  algorithm  for  calculating  loudness  pat- 
terns of  narrow  band  speech.  Acoustic.,  Speech,  and  Signal  Proc.,  ICASSP-97  IEEE 
International  Conference,  2:1311  -1314,  1997. 

[52]  V.  Kazan  and  A.  Simpson.  Enhancing  information  rich  regions  of  natural  VCV  and 
sentence  materials  presented  in  noise.  IEEE  Spoken  Language,  1:161-164,  1996. 

[53]  V.  Hazan  and  A.  Simpson.  The  effect  of  cue  enhancement  on  the  intelligibility  of 
nonsense  word  and  sentence  materials  presented  in  noise.  Speech  Communication, 
24:211-226,  1998. 

[54]  R.  Heilman.  Why  can  a decrease  in  dB(A)  produce  an  increase  in  loudness?  J. 
Acoust.  Soc.  Am,  82(5):1700-1705,  1987. 

[55]  R.  Heilman,  A.  Miskiewicz,  and  B.  Scharf.  Loudness  adaptation  and  excitation  pat- 
terns: Effects  of  frequency  and  level.  J.  Acoust.  Soc.  Am,  101  (4):2176-2185,  1997. 

[56]  H.  Hermansky.  Perceptual  linear  predictive  (PLP)  analysis  of  speech.  J.  Acoust.  Soc. 
Am,  187(4):1738-1752,  1990. 

[57]  H.  Hermansky  and  N.  Morgan  N.  RASTA  Processing  of  Speech.  IEEE  Transactions 
on  Speech  and  Audio  Processing,  2(4),  1994. 

[58]  H.  Hermansky,  N.  Morgan  N.,  and  H.G.  Hirsch.  Recognition  of  speech  in  additive  and 
convolutional  noise  based  on  RASTA  spectral  processing.  IEEE,  pages  H-83  H-86, 
1993. 

[59]  J.  Hillenbrand  and  M.  J.  Clark.  Effects  of  consonant  environment  on  vowel  formant 
patterns.  J.  Acoust.  Soc.  Am,  109(2):748-763,  2001. 

[60]  J.  Hillenbrand,  L.A  Getty,  M.  Clark,  and  K Wheeler.  Acoustic  characteristics  of 
American  English  vowels.  J.  Acoustic.  Sob.  Am,  97  (5):3099-311 1,  1995. 

[61]  J.B.  Hooper  and  M.J. Russell.  Objective  quality  of  a voice  over  internet  protocol 
system.  IEEE  Electronic  letters,  pages  1900-1902,  Oct.  2000. 

[62]  T.  Houtgast.  Auditory  analysis  of  vowel-like  sounds.  Acustica,  31:320-324,  1974a. 

[63]  Mohammed  Ismail.  Adaptation  of  generalized  feedforward  filters  with  applications  to 
speech.  University  of  Florida,  Ph.D.  Thesis,  1991. 

[64]  ISO-226.  Acoustics  - normal  equal  loudness  contours.  ISO  Geneva,  Switzerland,  1987. 

[65]  ISO-532.  Acoustics  - method  for  calculating  loudness  level.  ISO  Geneva,  Switzerland, 
1975. 

[66]  ISO-532.  BASIC  Program  for  calculating  the  loudness  of  sounds  from  their  1/3-Oct 
band  spectra  according  to  ISO  532  B.  Acustica,  Letters  to  the  editors,  55:63-67,  1984. 


188 


[67]  M.  Ito  and  J.  Tsuchida  amd  M.  Yano.  On  the  effectiveness  of  whole  spectral  shape 
for  vowel  perception.  J.  Acoust.Soc.  Am,  110(2):1141-1150,  2001. 

[68]  C.R.  Jankowski,  H.D.  Vo,  and  R.P.  Lippinann.  A comparison  of  signal  processing 
front  ends  for  automatic  word  recognition.  IEEE  Transactions  on  Speech  and  Audio 
Processing,  3(4):285-293,  July  1995. 

[69]  J.  Johnston.  Elimination  of  perceptual  entropy  using  noise  masking  criteria.  IEEE, 
A1.9:2524-2527,  1988. 

[70]  J.D.  Johnston.  Transform  coding  of  audio  signal  using  perceptual  noise  criteria.  IEEE 
J.  Select  Areas  Commun.,  6:314-323,  1988. 

[71]  B-H.  Juang.  Speech  recognition  in  adverse  environments.  Computer  Speech  and 
Language,  5:275-294,  1991. 

[72]  J.  Junqua,  H.  Wakita,  and  H.  Hermansky.  Evaluation  and  optimization  of  percep- 
tually based  ASR  front  end.  IEEE  Transactions  on  Speech  and  Audio  Processing, 
l(l):39-47,  1993. 

[73]  M.  Karjalainen,  T.  Altosaar,  and  M.  Vainio.  Speech  synthesis  using  warped  linear 
prediction  and  neural  networks.  Proc.  IEEE,  pages  877-880,  1998. 

[74]  M.  Karjalainen,  A.  Harma,  and  U.K.  Laine.  Realizable  warped  HR  filters  and  their 
properties.  Proc  IEEE  ICASSP-97,  Munich:2205-2208,  1997. 

[75]  M.  Karjalainen,  A.  Harma,  U.K.  Laine,  and  Jyri  Huopaniemi.  Warped  filters  and 
their  audio  applications.  IEEE  workshop  on  ASPAA  97,  Mohonk,  N.Y.,  1997. 

[76]  D.  Klatt  and  L.  Klatt.  Analysis,  synthesis,  and  perception  of  voice  quality  variations 
among  female  and  male  listeners.  J.  Acoust.Soc.  Am,  87(2):820- 857,  1990. 

[77]  K.N.  Stevens.  The  Potential  role  of  property  detectors  in  the  perception  of  consonants. 
Academic  Press,  1970. 

[78]  P.  Kroon  and  B.S.  Atal.  Quantization  procedures  for  4.8kbps  CELP  coders.  Proc. 
IEEE  ICAASP,  pages  1650-1654,  April  1987. 

[79]  L-M.  Lee,  J-K.  Chen,  and  H-C.  Wang.  Nonlinear  cepstral  equalization  method  for 
noisy  speech  recognition.  IEEE  Proc-Vis.  Image  Signal  Processing,  141(6):397-402, 
Dec  1994. 

[80]  M.R.  Leek  and  M.F.  Dorman.  Minimum  spectral  contrast  for  vowel  identification  by 
normal  hearing  and  hearing  impaired  listeners.  J.  Acoust.Soc.  Am,  81(1):148-154, 
1987. 

[81]  J.S.  Lim  and  A.V.  Oppenheim.  Enhancement  and  bandwidth  compression  of  noisy 
speech.  Proc.  IEEE,  67:1586-1604,  1979. 

[82]  P.  Lockwood  and  J.  Boudy.  Experiments  with  a nonlinear  spectral  subtractor  (NSS), 
Hidden  Markov  models  and  projection,  for  robust  recognition  in  cars.  Speech  Com- 
munications, 11:215-228,  1992. 


189 


[83]  J.  Makhoul.  Methods  for  nonlinear  distortion  of  speech  signals.  Proc  ICAASP  76, 
pages  87-90,  1976. 

[84]  D.  Mansour  and  B.H.  Juang.  A family  of  distortion  measures  based  upon  the  pro- 
jection operation  for  robust  speech  recognition.  IEEE  Transactions  on  Acoustics, 
Speech,  and  Signal  Processing,  37  (11):1659-1671,  Nov  1989. 

[85]  J.D.  Markel  and  A. H. Gray.  Linear  Prediction  of  Speech.  Springer- Verlag,  New  York, 
1976. 

[86]  L.E.  Marks.  Recalibrating  the  perception  of  loudness:  Interaural  transfer.  J.  Acoust. 
Soc.  Am,  100  (l):473-480,  1996. 

[87]  S.J.  McAdams.  Auditory  continuity  and  loudness  computation.  J.  Acoust.  Soc.  Am, 
103  (3):1580-1591,  1998. 

[88]  S.L.  McCabe.  A model  of  auditory  streaming.  J.  Acoust.  Soc.  Am,  101  (3):1611-1620, 
1997. 

[89]  S.  McCandless.  An  algorithm  for  automatic  formant  extraction  using  linear  predictive 
spectra.  IEEE  Trans,  on  Acoustic.,  Speech,  and  Signal  Proc.,  ASSP-22:135-141,  1974. 

[90]  G.W.  McNally.  Dynamic  range  control  of  audio  signals.  J.  Audio.  Eng.  Soc., 
32(5):316-327,  May  1984. 

[91]  D.  Messerschmitt.  A class  of  generalized  lattice  filters.  IEEE  trans.  on  Acoust., 
Speech,  and  Signal  Proc.,  28:198-204,  April  1980. 

[92]  B.C.  Moore  and  B.R.  Glasberg.  Auditory  filter  shapes  derived  in  simultaneous  and 
forward  masking.  J.  Acoust.  Soc.  America.,  70:1003-1014,  1981. 

[93]  B.C.  Moore  and  B.R.  Glasberg.  Suggested  formula  for  calculating  auditory-filter 
bandwidth  and  excitation  patterns.  J.  Acoust.  Soc.  America.,  74:750-753,  1983. 

[94]  B.C.  Moore  and  B.R.  Glasberg.  Formulae  describing  frequency  selectivity  as  a func- 
tion of  frequency  and  level  and  their  use  in  calculating  excitation  patterns.  Hearing 
Research,  28:209-225,  1987. 

[95]  B.C.  Moore,  B.R.  Glasberg,  and  T.  Baer.  Revision  of  Zwicker’s  loudness  model. 
Acustica,  82:335-445,  1996. 

[96]  B.C.  Moore,  B.R.  Glasberg,  and  T.  Baer.  A model  for  the  prediction  of  thresholds, 
loudness,  and  partial  loudness.  J.  Aud.  Eng.  Soc.,  45(4):224-239,  April  1997. 

[97]  B.C.  Moore,  B.R.  Glasberg,  and  T.  Baer.  Masking  patterns  for  synthetic  vowels  in 
simultaneous  and  forward  masking.  J.  Acoust.  Soc.  .Am,  73(3):906-917,  March  1983. 

[98]  R.P.  Morse  and  E.  Evans.  Preferential  and  non-preferential  transmission  of  formant 
information  by  an  analogue  cochlear  implant  using  noi.se:  The  role  of  the  nerve  thresh- 
old. Hearing  Research,  133:120-132,  1999. 

[99]  Hannes  Musch.  Fletcher  and  Galt’s  method  for  calculating  the  articulation  index. 
Institute  for  Hearing,  Speech,  and  Language,  Northeastern  University:  1-6,  1999. 


190 


[100]  N.  Nocerino,  F.K.  Soong,  L.R.  Rabiner,  and  D.H.  Klatt.  Comparative  study  of  several 
distortion  measures  for  speech  recognition.  Speech  Communication,  4:317-331,  1985. 

[101]  I.M.  Noordhoek  and  R.  Drullman.  Effect  of  reducing  temporal  intensity  modulations 
on  sentence  intelligibility.  J.Acoust.  Soc.  Am,  101(l):498-502,  1997. 

[102]  A.V.  Oppenheim,  D.H.  Johnston,  and  K.  Steiglitz.  Computation  of  spectra  \vith 
unequal  frequency  resolution  using  the  fast  Fourier  transform.  Proc.  IEEE,  59,  1971. 

[103]  A.V.  Oppenheim  and  R.W.  Schafer.  Discrete-Time  Signal  Processing.  Prentice-Hall, 
Englewood  Cliffs,  NJ,  1989. 

[104]  A.V.  Oppenheim,  E.  Weinstein,  K.C.  Zangi,  M.  Feder,  and  D.  Gauger.  Single  sensor 
active  noise  cancellation.  IEEE  Transactions  on  Speech  and  Audio  Processing,  2(2), 
April  1994. 

[105]  Davis  Pan.  A tutorial  on  MPEG /audio  compresion.  IEEE  Multimedia,  pages  60-74, 

1995. 

[106]  R.  Patterson.  Auditory  filter  shapes  derived  with  noise.  J.  Acoust.Soc.  Am,  74:640- 
654,  1976. 

[107]  R.  Patterson,  J.  Nimmo-Smith,  and  P.  Rice.  The  auditory  filterbank.  MRC-APU 
report  2341,  1991. 

[108]  R.W.  Peters,  B.C.  Moore,  and  T.  Baer.  Speech  reception  thresholds  in  noise  with 
and  without  spectral  and  temporal  dips  for  hearing  impaired  and  normally  hearing 
people.  J.Acoust.  Soc.  Am,  103  (l):577-587,  1998. 

[109]  Martin  Pflueger,  Robert  Hoeldrich,  and  William  Riedler.  A nonlinear  model  of  the 
peripheral  auditory  system.  lEM  Report,  pages  1-10,  Feb  1998. 

[110]  J.W  Picone.  Signal  modeling  techniques  in  speech  recognition.  Proceedings  of  the 
IEEE,  81  (9):1215-1247,  1993. 

[111]  C.J.  Plack.  Loudness  enhancement  and  intensity  discrimination  under  forward  and 
backward  masking.  J.  Acoust.  Soc.  Am,  100  (2):1024-1030,  1996. 

[112]  J.C.  Principe.  Handbook  of  neural  network  signal  processing.  CRC  Press,  Chapter  6, 

2002. 

[113]  J.C.  Principe,  B.  de  Vries,  and  P.G.  de  Oliveira.  The  Gamma  Filter-A  new  class  of 
adaptive  HR  filters  with  restricted  feedback.  IEEE  trans.  on  Signal  Proc.,  41(2):649- 
656,  1993. 

[114]  L.R.  Rabiner  and  B.H.  Juang.  Fundamentals  of  Speech  Recognition.  Prentice-Hall, 
Englewood  Cliffs,  N.J.,  1993. 

[115]  M.G.  Rahim,  B-H.  Juang,  W.  Chou,  and  E.  Buhrke.  Signal  conditioning  techniques 
for  robust  speech  recognition.  IEEE  Signal  Processing  Letters,  3(4):107-109,  April 

1996. 


191 


[116]  R.  Salami,  C.  Laflamme,  and  J.  Adoul.  Design  and  description  of  CS-ACELP:  A toll 
quality  8 kbs  speech  coder.  IEEE  Trans,  on  Speech  and  Audio  Proc.,  6(2):116-130, 
1998. 

[117]  B.  Scharf.  Hearing  Research  and  Theory.  Academic  Press  vol.  2,  New  York,  1983. 

[118]  R.S.  Schlauch.  A cognitive  influence  on  the  loudness  of  tones  that  change  continuously 
in  level.  J.  Acoust.  Soc.  Am,  92  (2);758-765,  1992. 

[119]  J.  G.  Schmidt  and  J.C.  Rutledge.  Multichannel  dynamic  range  compression  for  music 
signals.  IEEE,  pages  1013-1016,  1996. 

[120]  M.R.  Schroeder,  B.S.  Atal,  and  J.L.  Hall.  Optimizing  digital  speech  coders  by  ex- 
ploiting masking  properties  of  the  human  ear.  J.  Acoust.  Soc.  Amer.,  66:1647-1652, 
Dec  1979. 

[121]  S.A.  Shamma.  Speech  processing  in  the  auditory  system  I:  The  representation  of 
speech  sounds  in  the  response  of  the  auditory  nerve.  J.  Acoust.  Soc.  Am,  178  (5):1612- 
1621,  1985. 

[122]  H.  Sheikhzadeh,  H.  Sameti,  L.  Deng,  and  R.  Brennan.  Comparative  performance  of 
spectral  subtraction  and  hmm  based  speech  enhancement  strategies  with  application 
to  hearing  aid  design,  unknown,  1:13-16,  199. 

[123]  J.  Skloglund  and  W.B.  Kleijn.  On  time-frequency  masking  in  voiced  speech.  IEEE 
trans.  on  Speech  and  Audio  Proc.,  (8)4:361-369,  July  2000. 

[124]  Malcolm  Slaney.  An  efficient  implementation  of  the  Patterson-Holdsworth  auditory 
filter  bank.  Apple  Computer  Technical  Report  35,  1993. 

[125]  F.K.  Soong  and  M.M.  Sondhi.  A frequency  weighted  Itakura  spectral  distortion 
measure  and  its  application  to  speech  recognition  in  noise.  IEEE  Transactions  on 
Acoustics,  Speech,  and  Signal  Processing,  36(l):41-48,  Jan  1988. 

[126]  H.J.  Steeneken  and  T.  Houtgast.  A physical  method  for  measuring  speech  transmis- 
sion quality.  J.  Acoust. Soc.  Am,  67(l):318-326,  Jan.  1980. 

[127]  K.  Steiglitz.  A note  on  variable  recursive  digital  filters.  Proc.  IEEE  ASSP,  28(1):111- 
112,  1980. 

[128]  S.  Stevens.  The  direct  estimation  of  sensory  magnitudes:  loudness.  American  Journal 
of  Psychology,  69:1-25,  1956. 

[129]  B Strope  and  A.  Alwan.  Modeling  auditory  perception  to  improve  speech  recognition. 
IEEE,  1058-6393:1056-1060,  1998. 

[130]  H.W.  Strube.  Linear  prediction  on  a warped  frequency  scale.  J.  Acoust.  Soc.  Amer., 
68  no  4:1071-1076,  1980. 

[131]  M.H.  Sunwoo.  Real-time  implementation  of  the  VSELP  on  a 16-bit  DSP  chip.  IEEE 
Trans,  on  Consumer  Electronics,  37(4),  1991. 


192 


[132]  D.T.  Van  Tasell,  D.A.  Fabry,  and  L.M.  Thibodeau.  Vowel  identification  and  vowel 
masking  patterns  of  hearing  impaired  subjects.  J.  Acoust.Soc.  Am,  81(5):1586-1597, 
1987. 

[133]  F.  Taylor  and  G.  Zelniker.  Advanced  digital  signal  processing.  Marcel  Dekker,  Inc. 
New  York,  1994. 

[134]  H.  Traunmuller.  Analytic  expressions  for  the  tonotopic  sensory  scale.  J.  Acoust.Soc. 
Am,  88:97-100,  1990. 

[135]  D.  Tsoukalas,  J.  Mourjopoulos,  and  G.  Kokkinakis.  Audio  noise  cancellation  using  a 
subjective  signal  representation.  Conv.  IEEE  DSP-97,  pages  613-616,  1997. 

[136]  D.  Tsoukalas,  J.  Mourjopoulos,  and  G.  Kokkinakis.  Speech  enhancement  based  on 
audible  noise  suppression.  IEEE  Transactions  on  Acoustics,  Speech,  and  Signal  Pro- 
cessing, 5(6):497-514,  Nov  1997. 

[137]  D.  Tsoukalas,  M.Paraskevas,  and  Mourjopoulos  J.N.  Speech  enhancement  using 
psycho-acoustic  criteria.  Proc  IEEE  ICASSP,  pages  359-361,  April  1993. 

[138]  C.W.  Turner,  et  alS.J.  Smith,  P.  Aldridge,  and  S.  Stewart.  Formant  transition  dura- 
tion and  speech  recognition  in  normal  and  hearing  impaired  listeners.  J.Acoust.  Soc. 
Am,  101(5):2822-2825,  1997. 

[139]  O.  Viikki  and  K.  Laurila.  Cepstral  domain  segmental  feature  vector  normalization 
for  noise  robust  speech  recognition.  Speech  Communication,  1998:133-147,  25. 

[140]  N.  Virag.  Single  channel  speech  enhancement  based  on  masking  properties  of  the 
human  auditory  system.  IEEE  Trans,  on  Speech  and  Audio  Proc.,  7 (2),  March  1999. 

[141]  N.  Virag.  Speech  enhancement  based  on  masking  properties  of  the  auditory  system. 
Proc  IEEE  ICASSP,  pages  796-799,  May  1995. 

[142]  W.D.  Voiers.  Ch.34  Diagnostic  Evaluation  of  Speech  Intelligibility.  Dowden,  Hutchin- 
son, and  Ross,  Inc.  New  York,  1977. 

[143]  S.  Voran.  Objective  estimation  or  perceived  speech  quality-part  I:  Development  of  the 
measuring  normalizing  block  technique.  IEEE  trans.  on  Acoust.,  Speech,  and  Signal 
Proc.,  7(4):371-382,  July  1999. 

[144]  S.  Wang,  A.  Sekay,  and  A.  Gersho.  An  objective  measure  for  predicting  subjec- 
tive quality  of  speech  coders.  IEEE  Journal  on  Selected  Areas  of  Communication, 
10(5):819-829,  June  1992. 

[145]  E.D.  Young  and  M.B.  Sachs.  Representation  of  steady  state  vowels  in  the  temporal  as- 
pects of  the  discharge  patterns  of  populations  of  auditory-nerve  fibers.  J.  Acoust.Soc. 
Am,  66:1381-1403,  1979. 

[146]  F.G.  Zcng  and  C.W.  Turner.  Recovery  from  prior  stimulation  II:  Effect  upon  intensity 
discrimination.  Hearing  Research,  55:223  230,  1991. 

[147]  C.  Zhang  and  F.G.  Zeng.  Loudness  of  dynamic  stimuli  in  acoustic  and  electric  hearing. 
J.  Acoust.Soc.  Am,  102(5):2925-2932,  1997. 


193 


[148]  E.  Zwicker.  Procedure  for  calculating  loudness  of  temporally  variable  sounds.  J. 
Acoust.Soc.  Am,  62  (3);675-682,  1977. 

[149]  E.  Zwicker  and  H.  Fasti.  Psychoacoustics.  Springer  Series,  Berlin,  1998. 

[150]  E.  Zwicker  and  E.  Terhardt.  Analytic  expressions  for  critical  band  rate  and  critical 
bandwidth  as  a function  of  frequency.  J.  Acoust.Soc.  Am,  68:1523-1525,  1980. 


BIOGRAPHICAL  SKETCH 


I was  born  just  before  my  twin  sister  Laura  on  April  28th,  1971,  in  Pensacola,  Florida 
to  Mona  and  Michel  Boillot.  I have  French,  British,  and  American  citizenship.  I obtained 
a BSEE  from  the  University  of  Florida  in  1994.  I continued  into  the  graduate  program  of 
electrical  engineering  in  the  areas  of  biomedicine  and  neuroscience.  I worked  as  a research 
assistant  at  the  Veterans  Affairs  Medical  Center  under  neurosurgeon  Steven  N.  Roper.  This 
was  an  invaluable  experience. 

During  these  two  years  I completed  the  MSEE  degree  under  Dr.  John  Harris,  with  a 
thesis  titled  “Data  Acquisition  and  Visualization  for  Electrophysiology  Studies  In-  Vitro"  in 
1997.  I continued  into  the  Ph.D.  program  and  worked  at  the  Electronic  Communications 
Facility  on  target  acquisition  and  recognition  software  systems.  1 later  joined  Motorola  in 
June  of  1998  as  a DSP  software  engineer  for  the  iDEN  cellular  phone  division.  During 
this  time  I began  research  on  improving  real-time  performance  for  speech  processing  and 
recognition  systems.  I was  co-principal  investigator  of  a two  year  speech  project  graciously 
funded  by  the  iDEN  division  of  Motorola.  I have  been  employed  by  Motorola  through  this 
research  and  through  the  completion  of  this  final  degree.  I will  continue  research  in  third 
generation  multimedia  platforms  and  applications  for  wireless  devices. 


194 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  accept- 
able standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a 
dissertation  for  the  degree  of  Doctor  of  Philosophy. 


m G.  Harris,  Chairman 
Associate  Professor  of  Electrical  and 
Computer  Engineering 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  accept- 
able standards  of  scholarly  presentation  and  is  fully  adequa;^  in  scope  and  quality,  as  a 
dissertation  for  the  degree  of  Doctor  of  Philosophy^' 


incipe 
rnjesso¥-^fE 
^Engineering 


trical  and  Computer 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  accept- 
able standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a 
dissertation  for  the  degree  of  Doctor  of 


Professor  of  Electrical  and  Computer 
Engineering 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  accept- 
able standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a 
dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Purvis  Bedenbaugh 

Associate  Professor  of  Neuroscience 


This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  College  of  Engineering 
and  to  the  Graduate  School  and  was  accepted  as  partial  fulfillment  of  the  requirements  for 
the  degree  of  Doctor  of  Philosophy. 


MAY  2002 


Pramod  Khargonekar 
Dean,  College  of  Engineering 


Winfred  Phillips 
Dean,  Graduate  School 


